# Hugging Face Transformers

The library ``transformers`` from Hugging Face: http://huggingface.co provides a large collection of pre-trained transformers for a variety of NLP tasks.

Since our goal at the moment is to quicktly get started with familiar NLP tasks, we will use these pretrained models for some common use-cases. An excellent place to get started quickly with these pretrained models is:

https://huggingface.co/transformers/task_summary.html

The following list from the Hugging Face website enumerates the tasks for which pre-trained transformers are available:

* Sentiment analysis: is a text positive or negative?

* Text generation (in English): provide a prompt and the model will generate what follows.

* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)

* Question answering: provide the model with some context and a question, extract the answer from the context.

* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.

* Summarization: generate a summary of a long text.

* Translation: translate a text in another language.

* Feature extraction: return a tensor representation of the text.





## pipeline()

Perhaps the quickest way to get started with using the pretrained transformer models in this library, is to use the pipeline function. All we have to do is pass as argument the name of the task for which a pretrained transformer exists.

## Sentiment Analysis
Consider the case where we would like to perform a simple sentiment analysis on some text.


In [65]:
from transformers import pipeline
sentimental = pipeline("sentiment-analysis", device=0)
sentimental('Some of us love the Deep Learning workshops at SupportVectors!')

[{'label': 'POSITIVE', 'score': 0.9996930956840515}]

### Passing a batch of text for sentiment analysis
In the above case, we passed a single piece of text. What if we had multiple pieces of text? These can be passed an an array.

If you observe the results, you will notice that the sentiment analysis task in NLP is a work-in-progress. Even the transformer architectures are not able to many of them right. This is not far from the state-of-the-art performance at the moment. So there remains a lot of scope for improvement.


In [67]:
import textwrap

text_list = [
    'Some of us love the deep learning workshops at SupportVectors',
    'Today, the sky is shining, and the birds are flying',
    'All is well that ends well',
    'The autumn leaves wafting in the gentle breeze',
    ''
    'It was a dark and gloomy night',
    'Good grief! the turmoil of elections are upon us again!',
    'The train slowed down as it neared the station'
]

results = sentimental(text_list)

for result, text in zip (results, text_list):
    print(f"{result['label']:>15s} ({round(result['score'], 5):>5}) <== {textwrap.shorten(text, 60)}")

       POSITIVE (0.99764) <== Some of us love the being fired at our jobs
       POSITIVE (0.99978) <== Today, the sky is shining, and the birds are flying
       POSITIVE (0.99984) <== All is well that ends well
       POSITIVE (0.99952) <== The autumn leaves wafting in the gentle breeze
       NEGATIVE (0.98853) <== It was a dark and gloomy night
       POSITIVE (0.98478) <== Good grief! the turmoil of elections are upon us again!
       NEGATIVE (0.97952) <== The train slowed down as it neared the station


### Yelp Review Samples

In [7]:
from typing import List
import textwrap
from transformers import pipeline
import pandas as pd
import re

sentimental = pipeline("sentiment-analysis")
YELP_REVIEWS = '../../text/smaller_yelp_review.json'

index = 0
sentiments = pd.DataFrame(columns=['label', 'score', 'review'])

with open(YELP_REVIEWS) as f:
    for line in f.readlines():
        reviews: List[str] = re.findall('text":".*?"', line)
        if reviews:
            review = reviews[0]
            review = review.lstrip('text":"').rstrip('"')
            result: dict = sentimental(review)[0]
            result['review'] = review
            sentiments.loc[index] = {'label': result['label'], 
                                     'score': result['score'], 'review': result['review']}
            index += 1
print(f'Computed sentiments on {index} reviews!')

Computed sentiments on 100 reviews!


In [70]:
index = 1
for review in sentiments['review']:
    print (f'\n\n {index}: {review} END')



 1: As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It's what real estate agents would call \ END


 1: I am actually horrified this place is still in business. My 3 year old son needed a haircut this past summer and the lure of the $7 kids cut signs got me in the door. We had to wait a few minutes as both stylists were working on people. The decor in this place is total garbage. It is so tacky. The sofa they had at the time was a pleather sofa with giant holes in it. And my son noticed ants crawling all over the floor and the furniture. It was disgusting and I should have walked out then. Actually, I should have turned around and walked out upon entering but I didn't. So the older black m

In [71]:
sentiments

Unnamed: 0,label,score,review
0,NEGATIVE,0.943437,"As someone who has worked with many museums, I..."
1,NEGATIVE,0.999730,I am actually horrified this place is still in...
2,POSITIVE,0.999817,I love Deagan's. I do. I really do. The atmosp...
3,NEGATIVE,0.999594,"Dismal, lukewarm, defrosted-tasting \"
4,POSITIVE,0.964266,"Oh happy day, finally have a Canes near my cas..."
...,...,...,...
95,POSITIVE,0.999744,The whole experience is awesome. They start yo...
96,POSITIVE,0.731915,The reviews for this place on Facebook are gre...
97,POSITIVE,0.999792,Amazing food and perfect location for any Luke...
98,NEGATIVE,0.996666,"3 stars for food, but the service was awful. A..."


### ¿Hablas español?

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentimental = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

result = sentimental(u'¡Qué maravillosa experiencia!')
print(result)

[{'label': '5 stars', 'score': 0.9345182776451111}]


## Understanding the inner details.

Let us see what the code in the examples above did for us. This below is taken from the Hugging face tutorial.

The ``AutoTokenizer`` will automatically download the tokenizer associated with a model we pick. The ``AutoModelForSequenceClassifier`` will download the relevant model by name.

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Now that we have the model, and the tokenizer, we need to create a pipeline which will first tokenize the text, and then feed it into the model. But before we do it, let us inspect the tokenization part:





In [25]:
tokens_tensor = tokenizer(
    ["Some of us love the deep learning workshop at SupportVectors", "Horrible food!"],
    padding=True,    
    truncation=True,
    return_tensors="pt" )


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Let us now feed this into the model to get a result.

In [31]:
result = model (**tokens_tensor, output_hidden_states=True, output_attentions=True)

In [32]:
# Now, we just got the energies that we need to convert to probabilities.
import torch.nn.functional as F
pt_predictions = F.softmax(result[0], dim=-1)
pt_predictions

tensor([[0.6308, 0.2084, 0.0993, 0.0318, 0.0297],
        [0.0702, 0.0877, 0.2197, 0.2991, 0.3233]], grad_fn=<SoftmaxBackward>)

### Hidden states and attentions

Of-course, we can also inspect the model itself and see its parameter values. Let us do so:

In [35]:
hidden_states, attentions = result[-2:]

# Question Answering

Let us see how we can make a transformer answer questions about a piece of text we take from our course portal!

### Text

In this workshop, as an optional activity, there is the reading of some research papers from Arxiv. While it may appear intimidating, these papers are considered important readings if you want to be considered an expert. All papers broadly follow the IMRC-format (Introduction, Method, Results, Conclusion). The easiest way to start reading a paper is to first read the abstract, skim over the introduction, and short-circuit straight to the conclusions. Once you have gotten a general sense of the lay of the land, now carefully read the introduction -- preferably with a highlighter and pencil in hand. When you reach the method section, understand as much as you can on the first careful reading. It may take a few study iterations before it becomes fully comprehensible: so do not be daunted if at first study it feels intimidating. In due course of time, it becomes familiar and easy. If the paper really interests you deeply, see if you reproduce the results of the paper independently, or check out the python/PyTorch implementations of the ideas. Research is an open community, and implementations of an idea get shared very quickly in the open-source domain.

### Question

What is the easiest way to reading a paper?

In [40]:
# Let us start with the data

text = r"In this workshop, as an optional activity, there is the reading of some research papers from Arxiv. While it may appear intimidating, these papers are considered important readings if you want to be considered an expert. All papers broadly follow the IMRC-format (Introduction, Method, Results, Conclusion). The easiest way to start reading a paper is to first read the abstract, skim over the introduction, and short-circuit straight to the conclusions. Once you have gotten a general sense of the lay of the land, now carefully read the introduction -- preferably with a highlighter and pencil in hand. When you reach the method section, understand as much as you can on the first careful reading. It may take a few study iterations before it becomes fully comprehensible: so do not be daunted if at first study it feels intimidating. In due course of time, it becomes familiar and easy. If the paper really interests you deeply, see if you reproduce the results of the paper independently, or check out the python/PyTorch implementations of the ideas. Research is an open community, and implementations of an idea get shared very quickly in the open-source domain."

question_1 = r'What is the easiest way to reading a paper?'
question_2 = r'What happens in due course of time?'

In [41]:
qa = pipeline("question-answering")
result = qa(question=question_1, context=text)
result



{'score': 0.04903706908226013,
 'start': 355,
 'end': 407,
 'answer': 'first read the abstract, skim over the introduction,'}

In [42]:
qa(question=question_2, context=text)



{'score': 0.8247432112693787,
 'start': 860,
 'end': 889,
 'answer': 'it becomes familiar and easy.'}

# Fill in the blanks: transformers to the rescue!

Once can train a transformer to create a masked-model, which can fill in the blanks for you. Consider the following context:

Some of us ??? the deep learning workshop training at SupportVectors very much!



In [49]:
# Let's do a fill-in-the-blank exercise!
completer = pipeline("fill-mask")
fill_in_the_blank = f'Some of us {completer.tokenizer.mask_token} the deep learning workshop training at SupportVectors very much!'
completer(fill_in_the_blank)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'sequence': '<s>Some of us enjoy the deep learning workshop training at SupportVectors very much!</s>',
  'score': 0.4049758315086365,
  'token': 2254,
  'token_str': 'Ġenjoy'},
 {'sequence': '<s>Some of us appreciate the deep learning workshop training at SupportVectors very much!</s>',
  'score': 0.3120434284210205,
  'token': 5478,
  'token_str': 'Ġappreciate'},
 {'sequence': '<s>Some of us enjoyed the deep learning workshop training at SupportVectors very much!</s>',
  'score': 0.17192326486110687,
  'token': 3776,
  'token_str': 'Ġenjoyed'},
 {'sequence': '<s>Some of us love the deep learning workshop training at SupportVectors very much!</s>',
  'score': 0.05667373165488243,
  'token': 657,
  'token_str': 'Ġlove'},
 {'sequence': '<s>Some of us like the deep learning workshop training at SupportVectors very much!</s>',
  'score': 0.011890681460499763,
  'token': 101,
  'token_str': 'Ġlike'}]

# Text Generation

Can the transformers complete your sentences? Let us give it a shot with the text:

The easiest way to start reading a paper is to first read the abstract, skim over the introduction, and short-circuit straight to the conclusions. 


In [60]:
text_generator = pipeline("text-generation")
max_length = 60

text =r'The easiest way to start reading a paper is to first read the abstract, skim over the introduction, and short-circuit straight to the conclusions.'

text_generator(text, max_length=max_length, do_sample=False)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'The easiest way to start reading a paper is to first read the abstract, skim over the introduction, and short-circuit straight to the conclusions.\n\nThe paper is a good starting point for any student who wants to learn about the history of the world. It is also a good starting point'}]

Phew! We can rest assured that AI still cannot become our overlords and replace us yet! Let us try another text:

In [62]:
text = r'There are many legendary researchers in AI. The most important researcher in AI is'
text_generator(text, max_length=max_length, do_sample=False)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'There are many legendary researchers in AI. The most important researcher in AI is the famous mathematician and philosopher, Albert Einstein. He was the first to use the term "intelligent" in his book, The Theory of Everything. He was also the first to use the term "intelligent" in his'}]

# Named Entity Recognition

A named entity is a person, a location, an organization, etc.

In [64]:
ner = pipeline("ner")
text = r'There are many excellent AI workshops offered by Supportvectors, which is located in Fremont, CA. Asif Qamar is one of the instructors.'
ner(text)

[{'word': 'Support',
  'score': 0.9386759400367737,
  'entity': 'I-ORG',
  'index': 9},
 {'word': '##ve', 'score': 0.9356099367141724, 'entity': 'I-ORG', 'index': 10},
 {'word': '##ctors',
  'score': 0.9391953945159912,
  'entity': 'I-ORG',
  'index': 11},
 {'word': 'Fr', 'score': 0.9946166276931763, 'entity': 'I-LOC', 'index': 17},
 {'word': '##emont',
  'score': 0.9912159442901611,
  'entity': 'I-LOC',
  'index': 18},
 {'word': 'CA', 'score': 0.9523749947547913, 'entity': 'I-LOC', 'index': 20},
 {'word': 'As', 'score': 0.9996069669723511, 'entity': 'I-PER', 'index': 22},
 {'word': '##if', 'score': 0.9994383454322815, 'entity': 'I-PER', 'index': 23},
 {'word': 'Q', 'score': 0.9995712637901306, 'entity': 'I-PER', 'index': 24},
 {'word': '##ama',
  'score': 0.9949106574058533,
  'entity': 'I-PER',
  'index': 25},
 {'word': '##r', 'score': 0.9918298721313477, 'entity': 'I-PER', 'index': 26}]