# Introducción a la librería 🤗 Transformers
Este notebook es una demostración de las tareas que se pueden realizar con la librería 🤗 *transformers* de [Hugging face](https://huggingface.co)

In [None]:
#instalamos la librería
!pip install transformers[sentencepiece]

## Uso de tareas con `pipeline`
La manera más directa de usar una tarea pre-entrenada en los modelos transformers de Hugging Face es mediante un `pipeline`. La librería Transformers tiene tareas pre-entrenadas para:
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
  etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.  

Primero importamos la clase `pipeline` antes de poder usarla:


In [None]:
from transformers import pipeline

### Análisis de sentimientos

In [None]:
classifier = pipeline('sentiment-analysis')

Una vez instanciado el modelo, el uso es casi inmediato:

In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

In [None]:
classifier.model

In [None]:
classifier.model.config.id2label

Podemos elegir cualquier modelo pre-entrenado del [model hub](https://huggingface.co/models) de HugginFace. Por ejemplo el modelo `"nlptown/bert-base-multilingual-uncased-sentiment"` está pre-entrenado en varios idiomas, entre ellos el español

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

In [None]:
classifier.model

In [None]:
classifier('Me encanta el helado de vainilla')

In [None]:
classifier('I hate chocolate ice cream')

In [None]:
import pandas as pd

outputs = classifier(['Odio el helado de chocolate', 'Me encanta el helado de vainilla'])
pd.DataFrame(outputs)

### Zero-shot classification
Con esta tarea podemos clasificar un texto de manera no supervisada, sin necesidad de usar un conjunto de entrenamiento etiquetado.

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["business", "education", "sports"],
)

### Generación de texto
Usando un modelo generativo (de tipo auto-regresivo) podemos generar un texto a partir de una semilla.

In [None]:
generator = pipeline("text-generation")
generator("In this tutorial, we will teach you how to")

In [None]:
generator.model

In [None]:
output = generator("In this tutorial, we will teach you how to", num_return_sequences=2)
print(output[0]['generated_text'])
print(output[1]['generated_text'])

In [None]:
generator = pipeline("text-generation", model="mrm8488/spanish-gpt2")
generator("Me llamo Joan y me gusta")

### Mask filling
Esta tarea consiste en rellenar los huecos en medio de una frase. Esta es la tarea con la que se entrenan los modelos de lenguaje de los *transformers*

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
unmasker("I went to a japanese <mask> to eat some <mask> with cheese.", top_k=1)

### Named Entity Recognition
En esta tarea se etiqueta cada *token* según su pertenencia a una entidad.

In [None]:
ner = pipeline("ner", aggregation_strategy="simple")
outputs = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
pd.DataFrame(outputs)

In [None]:
# con aggregation_strategy="none" (default) muestra la etiqueta de cada token con un esquema B-I-O
ner = pipeline("ner", aggregation_strategy="none")
outputs = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
pd.DataFrame(outputs)

In [None]:
ner.model

### Sistemas de respuesta automática (question answering)
Esta tarea consiste en responder una pregunta a partir de un contexto.

In [None]:
question_answerer = pipeline("question-answering")
context = r"""
Joan lives in New York. His friend Antonio lives in Brussels.
"""
question_answerer(
    question="Where does Joan live?",
    context=context
)

In [None]:
context[15:23]

In [None]:
question_answerer.model

### Generación de resúmenes (*summarization*)
Esta tarea consiste en generar un resumen corto (abstractivo) a partir de un texto.

In [None]:
summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

### Traducción de texto
Se puede usar el modelo por defecto especificando el par de idiomas en el nombre de la tarea, o podemos usar un modelo específico del [model hub](https://huggingface.co/models).

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")
translator("Me llamo Joan y soy profesor de universidad.")

In [None]:
translator.model

### *Feature extraction*
El modelo devuelve la representación vectorial (embeddings) de la última capa para cada token (`last_hidden_states`)

In [None]:
extractor = pipeline(model="bert-base-uncased", task="feature-extraction")
sentence = "the BERT tokenizer was created with a WordPiece model."
result = extractor(sentence, return_tensors=True)
result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.

In [None]:
result = extractor(sentence, return_tensors=False)
type(result)

In [None]:
len(result[0])

In [None]:
len(result[0][0])

In [None]:
extractor.model

La longitud viene dada por el nº de tokens, no de palabras, añadiendo los tokens especiales `[CLS]` y `[SEP]`

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
output = tokenizer(sentence)
print(output)

In [None]:
print(len(output.input_ids)) #nº de tokens
print(len(sentence.split())) #nº de palabras

In [None]:
print(tokenizer.convert_ids_to_tokens(output.input_ids))

In [None]:
print(sentence.split())

## Sesgo de los modelos
Los modelos de lenguaje de los *transformers* se han entrenado con grandes cantidades de texto no supervisado, mayoritariamente obtenido de Internet. Por tanto, puede tener sesgos (racismo, sesgo de género, etc.)

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])