<a href="https://colab.research.google.com/github/hyesunyun/huggingface-lab/blob/main/huggingface_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace Lab

*Adapted from https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/quicktour.ipynb*

## What is HuggingFace?

HuggingFace is an open-source platform that provides tools for building, training, and deploying machine learning (ML) and natural language processing (NLP) models. It is similar to GitHub for AI and is a hub for AI developers.

HuggingFace has a large model and datasets library. You can browse and create your own models and share their weights (either as public or private). Also, you can find over 30,000 datasets for training or evaluating AI models.

In this lab, we will do a quick tour of using HuggingFace's Transformers & Datasets libraries for different common NLP tasks with pretrained models.

### Install Packages

This is only needed for Google Colab users.

In [None]:
# Transformers installation
! pip install transformers[torch] datasets
# Install dependencies
! pip install torch

### Quick Tour

We will start using the [`pipeline()`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) for rapid inference, and then quickly load a pretrained model and tokenizer with an [AutoClass](https://huggingface.co/docs/transformers/main/en/model_doc/auto) to solve text tasks.

#### Pipeline

`pipeline()` is the easiest way to use a pretrained model for a given task. It supports many common tasks out-of-the-box:

- Sentiment analysis: classify the polarity of a given text.
- Text generation (in English): generate text from a given input.
- Name entity recognition (NER): label each word with the entity it represents (person, date, location, etc.).
- Question answering: extract the answer from the context, given some context and a question.
- Fill-mask: fill in the blank given a text with masked words.
- Summarization: generate a summary of a long sequence of text or document.
- Translation: translate text into another language.
- Feature extraction: create a tensor representation of the text.

In this example, we will use `pipeline()` for sentiment analysis.

Import and load the pipeline.
The pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

In [None]:
classifier("We are very happy to show you the HuggingFace's Transformers library.")

You can also use more than one sentence by passing a list of sentences to the `pipeline()` which resturns a list of dictionaries.

In [None]:
results = classifier(["We are very happy to show you the HuggingFace's Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

The `pipeline()` can accommodate any model from the Model Hub, making it easy to adapt the `pipeline()`.

In this example, the task is translation.

In [None]:
# Change `xx` to the language of the input and `yy` to the language of the desired output.
# Examples: "en" for English, "fr" for French, "de" for German, "es" for Spanish, "zh" for Chinese, etc; translation_en_to_fr translates English to French
# You can view all the lists of languages here - https://huggingface.co/languages

# Helsinki-NLP/opus-mt-en-es is the model used for translation from English to Spanish
model_name = "Helsinki-NLP/opus-mt-en-es"
translator = pipeline("translation_en_to_es", model=model_name)

text = "Peanut butter is a food paste or spread made from ground, dry-roasted peanuts."
translator(text)

Another way to load the pipeline:
Use the AutoModelForSequenceClassification and AutoTokenizer to load the pretrained model and it's associated tokenizer (more on an AutoClass below):

```python
from transformers import AutoModel, AutoTokenizer

model_name = "username/model_name"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = pipeline("task name", model=model, tokenizer=tokenizer)
pipeline("text")
```

We can also iterate over an entire dataset via HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) library. We will load [opus-100's en-es test split dataset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-es/test).

In [None]:
from datasets import load_dataset

dataset = load_dataset("Helsinki-NLP/opus-100", name="en-es", split="test")

In [None]:
# select first 4 samples from the dataset and format
inputs = [sample["en"] for sample in dataset[:4]["translation"]]
result = translator(inputs)

for d in result:
  print(d["translation_text"])

For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the pipeline documentation for more information.

Let's practice with a question answering task but with a model that can handle French text. Search for a model in [Model Hub](https://huggingface.co/models) that handle question answering and French. Tip: Use the tags `Question Answering` NLP task and `French` language.
Use the appropriate model to load `pipeline()` and use the dataset: [manu/fquad2_test](https://huggingface.co/datasets/manu/fquad2_test)

Dataset Details:
- split: test
- pre-processing: question-answering pipeline takes in question and context. `qa(question=questions, context=contexts)`

In [None]:
##### Add your code below #####

# load pipeline

# load dataset

# sample first 4 rows

# call qa pipeline with questions and contexts

for result in results:
  print(result["score"])
  if result['score'] < 0.01:
      print("La réponse n'est pas dans le contexte fourni.") # The answer is not in the context provided.
  else :
      print(result['answer'])
  print("----------")

#### AutoClass and AutoTokenizer

Under the hood, `pipeline()` is powered by AutoModels and AutoTokenizers. An [AutoClass](https://huggingface.co/docs/transformers/main/en/model_doc/auto) is a shortcut that automatically retrieves the architecture of a pretrained model from it's name or path. You only need to select the appropriate AutoClass for your task and it's associated tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer).

A tokenizer is responsible for preprocessing text into a format that is understandable to the model. First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level (learn more about tokenization here). The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

Let's return to our translation example and see how you can use the AutoClass to replicate the results of the pipeline().

In [None]:
from transformers import AutoTokenizer

# Load tokenizer with the AutoTokenizer
model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model's vocabulary.

In [None]:
encoding = tokenizer("Peanuts are a good source of protein.")
print(encoding)

The tokenizer will return a dictionary containing:

input_ids: numerical representions of your tokens.
atttention_mask: indicates which tokens should be attended to.
Just like the pipeline(), the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:

In [None]:
batch = tokenizer(
    ["Is a taco a sandwich?", "I like cilantro with my tacos."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

Read the [preprocessing tutorial](https://huggingface.co/docs/transformers/main/en/preprocessing) for more details about tokenization.

Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. Since you are doing text summarization (sequence to sequence), load [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [None]:
from transformers import AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-en-es"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

See the [task summary](https://huggingface.co/docs/transformers/main/en/task_summary) for which AutoModel class to use for which task.

Now you can pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **:

In [None]:
outputs = model.generate(**batch)

The model outputs are tokenized. We need to decode the output to be able to view the output in natural language:

In [None]:
# decode the outputs
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print([d for d in decoded])

Let's practice with open-ended text generation task using AutoModelForCausalLM and AutoTokenizer.

In [None]:
#### Add your code below ####
model_name = "gpt2"

# import the AutoModelForCausalLM and AutoTokenizer

# Load the model and tokenizer

# Define the input text(s)

# Encode the input text(s)

# Generate the output(s)

# Decode the output(s)

print(decoded)

### Appendix

If you would like to learn how to improve open-ended language generation with very little effort, learn about better decoding methods.

https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb