thanks to https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_11_1_hf.ipynb

# Hugging face

https://huggingface.co/

Jde o společnost založenou v roce 2016. 
Aktuálně poskytuje podporu pro prakticky všechny faktory práce s UI / AI.
Jedná se mimo jiné o běh úloh (např. trénování modelů).
Poskytuje také rozsáhlé úložiště, jehož prostřednictvím lidé z oboru sdílí výstupy své práce.

Neopomenutelnou součástí jsou předtrénované modely (NN). 

## Instalace podpůrných knihoven

In [1]:
!pip install transformers



## Aplikace - analýza sentimentu

Analýza sentimentu je problém, který je spojen se zpracováním přirozeného jazyka, analýzy textu, počítačové lingvistiky a dalších oborů. 
Cílem je analyzovat emoční zabarvení mluvčího. 
V nejjednodušší verzi se jedná o binárné rozhodování, tedy zda es jedná o pozitivní či negativní tón.
V případě podrobnější analýzy se stanovuje i míra tedy kategorizace (smutek, strach, radost, ...)

In [2]:
# https://en.wikipedia.org/wiki/Sonnet_18

text = """"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
"""

Pro zpracovanání textu je nutné připravit jednotlivé úkony a seřadit je do správné posloupnosti.
Knihovna transformers of Hugging Face připravila tyto posloupnosti.
V souvislosti s vytvořením jejich instance obvykle dochází k dodatečnému stahování dat.

In [3]:
import pandas as pd
from transformers import pipeline

classifier = pipeline("text-classification")

2024-01-02 13:17:13.341983: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Ve chvíli, kdy je posloupnost zpracování připravena, je její použití relativně triviální.

In [4]:
outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,POSITIVE,0.97725


Výstupem analýzy sentimentu je konstatování, že zpracovávaný text je pozitvně laděný (0.97725)

## Určení kategorií entit (Entity Tagging)

In [5]:
import pandas as pd
from transformers import pipeline

text2 = "Abraham Lincoln was a president who lived in the United States."

tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [6]:
outputs = tagger(text2)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998893,Abraham Lincoln,0,15
1,LOC,0.999651,United States,49,62


Výstupem analýzy v tomto případě je identifikace osoby (Abhraham Lincoln) a identifikace místa (United States).

Lehce pozměněný výrok s vymyšlenými jmény a jeho analýza je uvedena níže.

In [9]:
text2a = "Honza Swejk is torturing turists at Novy Kakac."
outputs = tagger(text2a)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.97277,Honza Swejk,0,11
1,ORG,0.891609,Novy Kakac,36,46


## Získávání informací z textu (odpovědi na otázky)

In [10]:
import pandas as pd
from transformers import pipeline

reader = pipeline("question-answering")
question = "What now shall fade?"

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [11]:
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.5032,362,376,eternal summer


## Překlady

In [13]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [2]:
import pandas as pd
from transformers import pipeline

translator = pipeline("translation_en_to_de",
    model="Helsinki-NLP/opus-mt-en-de")

2024-01-02 13:50:29.127456: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



In [3]:
text = """"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
"""

outputs = translator(text, clean_up_tokenization_spaces=True,
    min_length=100)
print(outputs[0]['translation_text'])

„ Soll ich dich mit einem Sommertag vergleichen? Du bist schöner und gemäßigter: Raue Winde schütteln die lieblichen Knospen des Mais, Und der Sommerpacht hat zu kurz ein Datum: Irgendwann zu heiß das Auge des Himmels scheint, Und oft ist sein Gold Teint getrübt; Und jede Faire von Fair irgendwann sinkt, Durch Zufall, oder die Natur changieren Kurs, untrimmed: Aber dein ewiger Sommer wird nicht verblassen, noch verlieren Besitz von dem Schönen du ow.; Noch wird der Tod prahlen du wandert in seinem Schatten Wenn in ewigen Linien zur Zeit wachsen: So lange Männer atmen oder Augen sehen können, So lange lebt dies, und dies gibt dir Leben.


## Překlady II

In [4]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt"
)

outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

Downloading config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>


In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to French: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt"
)

outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

<pad> Hugging Face est une entreprise technologique basée à New York et à Paris.</s>


## Sumarizace

In [7]:
from transformers import pipeline

# https://en.wikipedia.org/wiki/Apple

text2 = """
An apple is a round, edible fruit produced by an apple tree 
(Malus spp., among them the domestic or orchard apple; Malus domestica). 
Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. 
The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found. 
Apples have been grown for thousands of years in Asia and Europe and were introduced to 
North America by European colonists. Apples have religious and mythological significance in many cultures, 
including Norse, Greek, and European Christian tradition. 
"""

summarizer = pipeline("summarization")

outputs = summarizer(text2, max_length=45,
    clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your min_length=56 must be inferior than your max_length=45.


 An apple is an edible fruit produced by an apple tree (Malus domestica) Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Apples have religious and mythological


## Generování textu

In [8]:
from transformers import pipeline

generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [11]:
# https://en.wikipedia.org/wiki/Sonnet_18

text = """"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
"""

outputs = generator(text, max_length=400)
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
But the young woman is faire and warm,
And as sweet and tender as a summer's day;
When winter turns to her new time,
Which may be too warm for thee, and winter to thee.’[24]
This is what he says in some of our later texts:
"With my daughter who is like to walk in the world,
Be as warm as the young lady's day, and not too hot,
The old woman may walk in the summer-time, her s

In [12]:
text = """"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
"""

outputs = generator(text, max_length=400)
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature’s changing course, untrimmed:
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall Death brag thou wander’st in his shade
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
So long may it abide to thy side: thy time's fall, thou willst stay in death;
If thy head is a hundred and forty,
From thy last days shalt thou turn thy lips in heaven:
Till thy mind's face be dark, yet thou mayst go to the city's gates,
And make thy peace at its gates, to God’s peace.
All thy life thy life thy time that thou mayst live:
In thy face be ever and live: in thy

In [13]:
# https://en.wikipedia.org/wiki/Apple

text2 = """
An apple is a round, edible fruit produced by an apple tree 
(Malus spp., among them the domestic or orchard apple; Malus domestica). 
Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. 
The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found. 
Apples have been grown for thousands of years in Asia and Europe and were introduced to 
North America by European colonists. Apples have religious and mythological significance in many cultures, 
including Norse, Greek, and European Christian tradition. 
"""

outputs = generator(text2, max_length=400)
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



An apple is a round, edible fruit produced by an apple tree 
(Malus spp., among them the domestic or orchard apple; Malus domestica). 
Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. 
The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found. 
Apples have been grown for thousands of years in Asia and Europe and were introduced to 
North America by European colonists. Apples have religious and mythological significance in many cultures, 
including Norse, Greek, and European Christian tradition. 
The most powerful and sought after cultivars are usually the orchards. 
Apples are made to resemble fruits from a tree, 
appled with a hard, brown, and pinky white peppercorn. 
Apples are a tropical fruit plant. Unlike apple fruits, which are very tough,
appled with a soft purple apple peel, and fruitier with a purple-crispy apple. 
The apple also resembles pearls and the pear is often believed to represent water

In [15]:
text2 = """
Jablko je kulaté, poživatelné ovoce, které roste na stromech. Tyto stromy jsou pěstovány napříč celým světem.
Místo původu stromu je ve Střední Asii, kde lze stále nalézt původní divoký druh.
Jablko má nabáženský a mytologický význam v mnoha kulturách včetně norské, řecké a také v Evropské křesťanské tradici.
"""

outputs = generator(text2, max_length=400)
print(outputs[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Jablko je kulaté, poživatelné ovoce, které roste na stromech. Tyto stromy jsou pěstovány napříč celým světem.
Místo původu stromu je ve Střední Asii, kde lze stále nalézt původní divoký druh.
Jablko má nabáženský a mytologický význam v mnoha kulturách včetně norské, řecké a také v Evropské křesťanské tradici.
Místo hénydim původní stromu je tikne stim.
Má dumé prót, kérożi, le těrt, vvíne zdavrán, půżu jabakí, půz dédí prót v pzíracz, kjemkí zlók, kjemzí jedrz v pzibi
- A místo hígydí nag dov kostanje.
Místo původní kulatie, tórna ztár, o vér nyvodlém przoy.
Jablko takim půz mżej půva kajh.
Jablko vod ládá dzát, všo


## Tokenizace

In [16]:
from transformers import AutoTokenizer
model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)

In [17]:
encoded = tokenizer('Tokenizing text is easy.')
print(encoded)

{'input_ids': [101, 19204, 6026, 3793, 2003, 3733, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [18]:
tokenizer.convert_ids_to_tokens(encoded.input_ids)

['[CLS]', 'token', '##izing', 'text', 'is', 'easy', '.', '[SEP]']

In [19]:
text = [
    "This movie was great!",
    "I hated this move, waste of time!",
    "Epic?"
]

encoded = tokenizer(text, padding=True, add_special_tokens=True)

print("**Input IDs**")
for a in encoded.input_ids:
    print(a)

print("**Attention Mask**")
for a in encoded.attention_mask:
    print(a)

**Input IDs**
[101, 2023, 3185, 2001, 2307, 999, 102, 0, 0, 0, 0]
[101, 1045, 6283, 2023, 2693, 1010, 5949, 1997, 2051, 999, 102]
[101, 8680, 1029, 102, 0, 0, 0, 0, 0, 0, 0]
**Attention Mask**
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]


# Datasets

In [20]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
from datasets import list_datasets

all_datasets = list_datasets()

print(f"Hugging Face hub currently contains {len(all_datasets)}")
print(f"datasets. The first 5 are:")
print("\n".join(all_datasets[:10]))

  all_datasets = list_datasets()


Hugging Face hub currently contains 91128
datasets. The first 5 are:
acronym_identification
ade_corpus_v2
adversarial_qa
aeslc
afrikaans_ner_corpus
ag_news
ai2_arc
air_dialogue
ajgt_twitter_ar
allegro_reviews


In [22]:
from datasets import load_dataset

emotions = load_dataset("emotion")
emotions

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [23]:
import pandas as pd

emotions.set_format(type='pandas')
df = emotions["train"][:]
df[:5]

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3
