Check code and tasks here: https://huggingface.co/docs/transformers/v4.15.0/en/task_summary

Check models here: https://huggingface.co/models

Check pipelines here: https://huggingface.co/docs/transformers/main_classes/pipelines

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 28.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transforme

**Sentiment Analysis**

In [2]:
from transformers import pipeline
#with default model by huggingface
classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999


**Named Entity Recognition**


In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
#we select teh model from huggingface models
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


**Q.A.** extractive! with some coding instead of pipeline

In [4]:
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('salti/bert-base-multilingual-cased-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('salti/bert-base-multilingual-cased-finetuned-squad')
question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

input_ids = tokenizer.encode(question, answer_text)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
sep_index = input_ids.index(tokenizer.sep_token_id)

num_seg_a = sep_index + 1
num_seg_b = len(input_ids) - num_seg_a
segment_ids = [0]*num_seg_a + [1]*num_seg_b

outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                             token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                             return_dict=True) 

start_scores = outputs.start_logits
end_scores = outputs.end_logits

answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores)

answer = tokens[answer_start]
for i in range(answer_start + 1, answer_end + 1):
    if tokens[i][0:2] == '##':
        answer += tokens[i][2:]
    else:
        answer += ' ' + tokens[i]
print('Answer: "' + answer + '"')

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/822 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

Answer: "340M"


**Fill the mask** with ixaMBert multilingual bidirectional language model

In [7]:
from transformers import pipeline
unmaskerEU = pipeline("fill-mask", model="ixa-ehu/ixambert-base-cased") #multilingual model en/es/eu (trained by me :D)
unmaskerEU("Nire aitak amari gona gorria ekarri [MASK].", top_k=5)

Some weights of the model checkpoint at ixa-ehu/ixambert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.7118955254554749,
  'sequence': 'Nire aitak amari gona gorria ekarri zion.',
  'token': 5094,
  'token_str': 'zion'},
 {'score': 0.15650056302547455,
  'sequence': 'Nire aitak amari gona gorria ekarri dio.',
  'token': 1210,
  'token_str': 'dio'},
 {'score': 0.050741858780384064,
  'sequence': 'Nire aitak amari gona gorria ekarri zidan.',
  'token': 20444,
  'token_str': 'zidan'},
 {'score': 0.014231753535568714,
  'sequence': 'Nire aitak amari gona gorria ekarri dit.',
  'token': 3625,
  'token_str': 'dit'},
 {'score': 0.011380593292415142,
  'sequence': 'Nire aitak amari gona gorria ekarri zigun.',
  'token': 29717,
  'token_str': 'zigun'}]

In [8]:
unmaskerEU("My cat was really [MASK].", top_k=5)

[{'score': 0.032386429607868195,
  'sequence': 'My cat was really good.',
  'token': 2908,
  'token_str': 'good'},
 {'score': 0.02462773770093918,
  'sequence': 'My cat was really nice.',
  'token': 36449,
  'token_str': 'nice'},
 {'score': 0.019835611805319786,
  'sequence': 'My cat was really beautiful.',
  'token': 13840,
  'token_str': 'beautiful'},
 {'score': 0.01757289096713066,
  'sequence': 'My cat was really crazy.',
  'token': 47683,
  'token_str': 'crazy'},
 {'score': 0.015928538516163826,
  'sequence': 'My cat was really wonderful.',
  'token': 40493,
  'token_str': 'wonderful'}]