# Utilización de modelos basados en Transformers


La libreria Transformers (https://huggingface.co/transformers) se especializa en ofrecer una API para utilizar y entrenar modelos basados en la arquitectura de red neuronales de tipo **_Transformers_** (https://arxiv.org/abs/1706.03762, 2017). Estos modelos permiten abordar un amplio conjunto de tareas de NLP, dentro de las cuales:


- **Question answering** : a partir de una pregunta y un texto, extraer una respuesta del texto


- **Sentiment Analysis** : determinar si un texto es positivo o negativo


- **Generación de texto** : generar un texto a partir de una secuencia inicial


- **Reconocimiento de entidades** (NER): identificar y clasificar secuencias de palabras que representan una entidad (persona, lugar, etc.)


- **Resumen automático**: Generar un resumen a partir de un texto largo


- **Traducción automática**: Traducir un texto hacia otro idioma


- **Completar textos con palabras faltantes**: A partir de un texto dentro del cuál algunas palabras están replazadas por [MASK], proponer palabras para completar


In [None]:
!pip -V
!python -V

In [None]:
#!pip install --upgrade tensorflow
#!pip install --user transformers==2.9.1

In [None]:
import transformers
transformers.__version__

La libreria está acompañado por un "hub" de modelos pre-entrenados, por idioma y por tarea: https://huggingface.co/models

La manera más fácil de utilizar un modelo pre-entrenados para abordar tareas NLP consiste en utilizar el método <code>pipeline()</code>.



In [None]:
from transformers import pipeline

## 1. Question-Answering

In [None]:
model="distilbert-base-cased-distilled-squad"
nlp = pipeline("question-answering", model=model, tokenizer=model)

### 1.1 Ejemplo básico

In [None]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""


In [None]:
result = nlp(question="What is question answering?", context=context)

In [None]:
print(result['answer'])

In [None]:
print(round(result['score'], 4))

In [None]:
print(str(result['start']) + " " + str(result['end']))

### 1.2 Ejemplos combinados con Wikipedia

In [None]:
import wikipedia
wikipedia.set_lang("en")

In [None]:
context_wiki = wikipedia.summary(wikipedia.search("Ada Lovelace")[0], sentences=3)
print(context_wiki)

In [None]:
result = nlp(question="Who is Ada Lovelace?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="What is the profession of Ada Lovelace?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="When does Ada Lovelace born?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="What did Ada Lovelace believe?", context=context_wiki)
print(result['answer'])

In [None]:
context_wiki = wikipedia.summary(wikipedia.search("Chile")[0], auto_suggest=False, sentences=4)
print(context_wiki)

In [None]:
questions = [
    "What is the capital of Chile?",
    "How many people live in Chile?",
    "Where is Chile?",
]

for question in questions:
    
    result = nlp(question=question, tokenizer=model, model=model, context=context_wiki)
    print(question)
    print(result['answer'])

### 1.3 Ejemplo en español

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model="mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"

tokenizer = AutoTokenizer.from_pretrained(model)

model = AutoModelForQuestionAnswering.from_pretrained(model)

In [None]:
wikipedia.set_lang("es")

context_wiki = wikipedia.summary(wikipedia.search("Valdivia")[0], auto_suggest=False, sentences=4)
print(context_wiki)

In [None]:
questions = [
    "En qué año fue fundada Valdivia?",
    "Cuál río pasa por Valdivia?",
    "Cuántos habitantes viven en Valdivia?",
    "A qué distancia de Santiago se encuentra Valdivia?"
]

for question in questions:
    
    result = nlp(question=question, tokenizer=tokenizer, model=model, context=context_wiki)
    print(question)
    print(result['answer'])

## 2. Palabra faltante (_fill mask_)

In [None]:
from transformers import pipeline, AutoModelWithLMHead, AutoTokenizer

path="dccuchile/bert-base-spanish-wwm-uncased"

tokenizer = AutoTokenizer.from_pretrained(path)

model = AutoModelWithLMHead.from_pretrained(path)

nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [None]:
from pprint import pprint

sequence = "Para solucionar los problemas de Chile, el presidente debe "\
+ tokenizer.mask_token +\
" de inmediato."

result = nlp(sequence)

pprint(result)

## 3. Generación de textos 

In [None]:
text_generator = pipeline("text-generation", model="gpt2")

In [None]:
print(text_generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. ", max_length=250, do_sample=True))

In [None]:
text_generator = pipeline("text-generation", model="mrm8488/GuaPeTe-2-tiny", tokenizer="mrm8488/GuaPeTe-2-tiny")

In [None]:
print(text_generator("Desde ayer, el equipo de fútbol de Chile participa a la copa ", max_length=100, do_sample=True))

## 4. Resumen automático

In [None]:
summarizer = pipeline("summarization", model="t5-small")

In [None]:
wikipedia.set_lang("en")
TEXT = wikipedia.summary(wikipedia.search("Ada Lovelace")[0], sentences=10, auto_suggest=False)
print(len(TEXT))
print(TEXT)

In [None]:
print(summarizer(TEXT, max_length=200, min_length=30, do_sample=False))

## 5. Traducción automática

In [None]:
translator = pipeline("translation_en_to_fr")

In [None]:
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

In [None]:
#!pip install mosestokenizer

- Español -> Inglés

In [None]:
from transformers import AutoModelWithLMHead, AutoTokenizer, MarianTokenizer, MarianMTModel

model_name = "Helsinki-NLP/opus-mt-es-en"

tokenizer = MarianTokenizer.from_pretrained(model_name)

model = MarianMTModel.from_pretrained(model_name)

In [None]:
src_text=["Valdivia es una comuna y ciudad de Chile, capital de la provincia homónima y de la Región de Los Ríos. Se encuentra a 847,6 km al sur de Santiago, la capital de Chile."]

In [None]:
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

In [None]:
print(tgt_text)

- Inglés -> Español

In [None]:
from transformers import AutoModelWithLMHead, AutoTokenizer, MarianTokenizer, MarianMTModel

model_name = "Helsinki-NLP/opus-mt-en-es"

tokenizer = MarianTokenizer.from_pretrained(model_name)

model = MarianMTModel.from_pretrained(model_name)

In [None]:
src_text=["Valdivia is a municipality and city of Chile, capital of the province of Chile and the Los Ríos Region. It is located 847.6 km south of Santiago, the capital of Chile."]

In [None]:
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

print(tgt_text)

## 6. Análisis de sentimientos

In [None]:
nlp = pipeline("sentiment-analysis")

In [None]:
result = nlp("I hate you")

pprint(result)

result = nlp("I love you")

pprint(result)

In [None]:
text="This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

result = nlp(text)

pprint(result)

text2="This is a film which should be seen by anybody interested in, effected by, or suffering from an eating disorder. It is an amazingly accurate and sensitive portrayal of bulimia in a teenage girl, its causes and its symptoms. The girl is played by one of the most brilliant young actresses working in cinema today, Alison Lohman, who was later so spectacular in 'Where the Truth Lies'. I would recommend that this film be shown in all schools, as you will never see a better on this subject. Alison Lohman is absolutely outstanding, and one marvels at her ability to convey the anguish of a girl suffering from this compulsive disorder. If barometers tell us the air pressure, Alison Lohman tells us the emotional pressure with the same degree of accuracy. Her emotional range is so precise, each scene could be measured microscopically for its gradations of trauma, on a scale of rising hysteria and desperation which reaches unbearable intensity. Mare Winningham is the perfect choice to play her mother, and does so with immense sympathy and a range of emotions just as finely tuned as Lohman's. Together, they make a pair of sensitive emotional oscillators vibrating in resonance with one another. This film is really an astonishing achievement, and director Katt Shea should be proud of it. The only reason for not seeing it is if you are not interested in people. But even if you like nature films best, this is after all animal behaviour at the sharp edge. Bulimia is an extreme version of how a tormented soul can destroy her own body in a frenzy of despair. And if we don't sympathise with people suffering from the depths of despair, then we are dead inside."

result = nlp(text2)

pprint(result)



In [None]:
model = "nlptown/bert-base-multilingual-uncased-sentiment"

nlp = pipeline("sentiment-analysis", model=model, tokenizer=model)

In [None]:
text="Esta historia, en conclusión, es una impresionante obra cinematográfica, que solventa la idea de la imperfección de la perfección, y de la utilidad de la memoria, recomendable para aquel que guste de películas abstractas y que buscan expresar una idea sobre cualquier otra cosa."

result = nlp(text)

pprint(result)

## 7. Reconocimiento de entidades

In [None]:
nlp = pipeline("ner")

In [None]:
TEXT = "The Trump campaign said Wednesday that it will seek a limited recount of two Wisconsin counties. The campaign needs to officially request the recount, and pay an upfront fee, by 5 p.m. CT Wednesday. Wisconsin election officials confirmed on Wednesday that they received a partial payment of $3 million from the Trump campaign. These officials said last week that the price tag for a statewide recount would be approximately $7.9 million."
print(TEXT)

In [None]:
pprint(nlp(TEXT))

## 8. Modelos para resolver tareas de NLP basados en redes neuronales Transformers

- Todas las tareas anteriores pueden modelarse cómo un problema de "traducción":
    - **Input**: una secuencia de palabras
    - **Ouput**: una secuencia de palabras (eventualmente 1 secuencia de 1 palabra para los problemas de clasificación)
    
   
       
- Historicamente, en NLP, los problemas de "traducción" de secuencias se abordan con modelos de redes neuronales recurrentes (RNN). En 2017, la arquitectura _Transformers_ mejora las arquitecturas RNN integrando un mecanismo de "atención".


<img src="architecture.png" />


### - Artículo: "Attention is all you need" (2017) : https://arxiv.org/abs/1706.03762

### - Charla Jorge Pérez (DCC - Universidad de Chile, septiembre 2020): https://www.youtube.com/watch?v=4cY1H-QVlZM
