# Utilización de modelos basados en Transformers


La libreria Transformers (https://huggingface.co/transformers) se especializa en ofrecer una API para utilizar y entrenar modelos basados en la arquitectura de red neuronales de tipo **_Transformers_** (https://arxiv.org/abs/1706.03762, 2017). Estos modelos permiten abordar un amplio conjunto de tareas de NLP, dentro de las cuales:


- **Question answering** : a partir de una pregunta y un texto, extraer una respuesta del texto


- **Sentiment Analysis** : determinar si un texto es positivo o negativo


- **Generación de texto** : generar un texto a partir de una secuencia inicial


- **Reconocimiento de entidades** (NER): identificar y clasificar secuencias de palabras que representan una entidad (persona, lugar, etc.)


- **Resumen automático**: Generar un resumen a partir de un texto largo


- **Traducción automática**: Traducir un texto hacia otro idioma


- **Completar textos con palabras faltantes**: A partir de un texto dentro del cuál algunas palabras están replazadas por [MASK], proponer palabras para completar


In [None]:
!pip -V
!python -V

In [None]:
#!pip install --user transformers

In [1]:
import transformers
transformers.__version__

'4.12.2'

La libreria está acompañado por un "hub" de modelos pre-entrenados, por idioma y por tarea: https://huggingface.co/models

La manera más fácil de utilizar un modelo pre-entrenados para abordar tareas NLP consiste en utilizar el método <code>pipeline()</code>.



In [2]:
from transformers import pipeline

## 1. Question-Answering

In [None]:
model="distilbert-base-cased-distilled-squad"
nlp = pipeline("question-answering", model=model, tokenizer=model)

### 1.1 Ejemplo básico

In [None]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""


In [None]:
result = nlp(question="What is question answering?", context=context)

In [None]:
result

In [None]:
print(result['answer'])

In [None]:
print(round(result['score'], 4))

### 1.2 Ejemplos combinados con Wikipedia

In [None]:
import wikipedia
wikipedia.set_lang("en")

In [None]:
context_wiki = wikipedia.summary(wikipedia.search("Ada Lovelace")[0], sentences=3)
print(context_wiki)

In [None]:
result = nlp(question="Who is Ada Lovelace?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="What is the profession of Ada Lovelace?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="When does Ada Lovelace born?", context=context_wiki)
print(result['answer'])

In [None]:
result = nlp(question="What did Ada Lovelace believe?", context=context_wiki)
print(result['answer'])

In [None]:
context_wiki = wikipedia.summary(wikipedia.search("Melinka")[0], auto_suggest=False, sentences=10)
print(context_wiki)

In [None]:
questions = [
    "What is Melinka?",
    "How many people live in Melinka?",
    "Where is Melinka?",
]

for question in questions:
    
    result = nlp(question=question, tokenizer=model, model=model, context=context_wiki)
    print(question)
    print(result['answer'])

### 1.3 Ejemplo en español

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model="mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"

tokenizer = AutoTokenizer.from_pretrained(model)

model = AutoModelForQuestionAnswering.from_pretrained(model)

In [None]:
wikipedia.set_lang("es")

context_wiki = wikipedia.summary(wikipedia.search("Valdivia")[0], auto_suggest=False, sentences=4)
print(context_wiki)

In [None]:
questions = [
    "En qué año fue fundada Valdivia?",
    "Cuál río pasa por Valdivia?",
    "Cuántos habitantes viven en Valdivia?",
    "A qué distancia de Santiago se encuentra Valdivia?"
]

for question in questions:
    
    result = nlp(question=question, tokenizer=tokenizer, model=model, context=context_wiki)
    print(question)
    print(result['answer'])

## 2. Palabra faltante (_fill mask_)

### 2.1 Mask

In [None]:
from transformers import pipeline, AutoModelWithLMHead, AutoTokenizer

path="dccuchile/bert-base-spanish-wwm-uncased"

tokenizer = AutoTokenizer.from_pretrained(path)

model = AutoModelWithLMHead.from_pretrained(path)

nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [None]:
from pprint import pprint

sequence = "Para solucionar los problemas de Chile, el presidente debe "\
+ tokenizer.mask_token +\
" de inmediato."

result = nlp(sequence)

pprint(result)

### 2.2 Palabra siguiente

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")


In [None]:
sequence = f"Chile, officially the Republic of Chile, is a country in western South"

In [None]:
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

## 3. Generación de textos 

In [None]:
text_generator = pipeline("text-generation", model="gpt2")


In [None]:
print(text_generator("Chile, officially the Republic of Chile, is a country in western South America. It occupies a long, narrow strip of land between the Andes to the east and the Pacific Ocean to the west. Chile covers an area of", max_length=100, do_sample=False))

In [None]:
text_generator = pipeline("text-generation", model="DeepESP/gpt2-spanish", tokenizer="DeepESP/gpt2-spanish")

In [None]:
print(text_generator("Chile cuenta con un índice de desarrollo humano considerado muy alto y es el más alto de América Latina. Es clasificado como un país", max_length=100, do_sample=False))

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

## 4. Resumen automático

In [None]:
summarizer = pipeline("summarization", model="t5-small")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [None]:
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

In [None]:
import wikipedia
wikipedia.set_lang("en")
TEXT = wikipedia.summary(wikipedia.search("Ada Lovelace")[0], sentences=10, auto_suggest=False)
print(len(TEXT))
print(TEXT)

In [None]:
print(summarizer(TEXT, max_length=200, min_length=30, do_sample=False))

## 5. Traducción automática

In [None]:
translator = pipeline("translation_en_to_fr")

In [None]:
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

In [None]:
#!pip install mosestokenizer
#!pip install sentencepiece

- Español -> Inglés

In [3]:
from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "mrm8488/mbart-large-finetuned-opus-es-en-translation"#mbart-large-finetuned-bible-es-en-translation"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/406 [00:00<?, ?B/s]

In [4]:
inputs = tokenizer(
    "A lo largo de la historia de Chile han existido diversos partidos políticos, los que fueron o prohibidoso suspendidos en 1973.",
    return_tensors="pt"
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

<s> Across the history of Chile, there have been several political parties, which were or banned, suspended in 1973.</s>


- Inglés -> Español

## 6. Análisis de sentimientos

In [5]:
nlp = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [7]:
result = nlp("I hate you")

print(result)

result = nlp("I love you")

print(result)

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]


In [9]:
text="This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

result = nlp(text)

print(result)

text2="This is a film which should be seen by anybody interested in, effected by, or suffering from an eating disorder. It is an amazingly accurate and sensitive portrayal of bulimia in a teenage girl, its causes and its symptoms. The girl is played by one of the most brilliant young actresses working in cinema today, Alison Lohman, who was later so spectacular in 'Where the Truth Lies'. I would recommend that this film be shown in all schools, as you will never see a better on this subject. Alison Lohman is absolutely outstanding, and one marvels at her ability to convey the anguish of a girl suffering from this compulsive disorder. If barometers tell us the air pressure, Alison Lohman tells us the emotional pressure with the same degree of accuracy. Her emotional range is so precise, each scene could be measured microscopically for its gradations of trauma, on a scale of rising hysteria and desperation which reaches unbearable intensity. Mare Winningham is the perfect choice to play her mother, and does so with immense sympathy and a range of emotions just as finely tuned as Lohman's. Together, they make a pair of sensitive emotional oscillators vibrating in resonance with one another. This film is really an astonishing achievement, and director Katt Shea should be proud of it. The only reason for not seeing it is if you are not interested in people. But even if you like nature films best, this is after all animal behaviour at the sharp edge. Bulimia is an extreme version of how a tormented soul can destroy her own body in a frenzy of despair. And if we don't sympathise with people suffering from the depths of despair, then we are dead inside."

result = nlp(text2)

print(result)



[{'label': 'NEGATIVE', 'score': 0.999795138835907}]
[{'label': 'POSITIVE', 'score': 0.9994988441467285}]


In [10]:
model = "nlptown/bert-base-multilingual-uncased-sentiment"

nlp = pipeline("sentiment-analysis", model=model, tokenizer=model)

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [12]:
text="Esta historia, en conclusión, es una impresionante obra cinematográfica, que solventa la idea de la imperfección de la perfección, y de la utilidad de la memoria, recomendable para aquel que guste de películas abstractas y que buscan expresar una idea sobre cualquier otra cosa."

result = nlp(text)

print(result)

[{'label': '5 stars', 'score': 0.6283885836601257}]


## 7. Reconocimiento de entidades

In [None]:
nlp = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

In [None]:
TEXT = "The Trump campaign said Wednesday that it will seek a limited recount of two Wisconsin counties. The campaign needs to officially request the recount, and pay an upfront fee, by 5 p.m. CT Wednesday. Wisconsin election officials confirmed on Wednesday that they received a partial payment of $3 million from the Trump campaign. These officials said last week that the price tag for a statewide recount would be approximately $7.9 million."
print(TEXT)

In [None]:
print(nlp(TEXT))

## 8. Modelos para resolver tareas de NLP basados en redes neuronales Transformers

- Todas las tareas anteriores pueden modelarse cómo un problema de "traducción":
    - **Input**: una secuencia de palabras
    - **Ouput**: una secuencia de palabras (eventualmente 1 secuencia de 1 palabra para los problemas de clasificación)
    
   
       
- Historicamente, en NLP, los problemas de "traducción" de secuencias se abordan con modelos de redes neuronales recurrentes (RNN). En 2017, la arquitectura _Transformers_ mejora las arquitecturas RNN integrando un mecanismo de "atención".


<img src="architecture.png" />


### - Artículo: "Attention is all you need" (2017) : https://arxiv.org/abs/1706.03762

### - Charla Jorge Pérez (DCC - Universidad de Chile, septiembre 2020): https://www.youtube.com/watch?v=4cY1H-QVlZM
