# Основни задачи при користење на јазични модели

Во рамки на оваа вежба ќе решиме неколку карактеристични проблеми кога станува збор за обработка на природните јазици:

> Одредување на сентимент

> Одговарање на прашања

> Идентификација на именувани ентитети

> Сумаризација на текст

> Генерирање на текст

> Преведување на текст





Пред да започнеме со решавање на проблемите, накратко ќе го објаснеме концептот на [pipelines во рамки на Huggingface Transformers](https://huggingface.co/transformers/main_classes/pipelines.html).

Pipelines се всушност објекти кои се апстракција на покомплексниот код кој се наоѓа во самата библиотека и ни нудат полесен начин на пристапување на моделите кои ги нуди Huggingface Transformer и со кои можеме да решаваме различни проблеми.


Секако, најпрво треба да инсталираме соодветен пакет за да може да ги искористиме Huggingface Transformers.

In [None]:
! pip install transformers

In [3]:
from transformers import pipeline

## Одредување на сентимент

Кога ја имаме задачата да одредиме сентимент на одреден текст, едноставно искористуваме pipeline на кого му кажуваме дека задачата која ќе ја решаваме е "sentiment-analysis". 

Така, резултатот кој го добиваме е класификација на самиот текст како позитивен или негативен (сентиментот е позитивен или негативен), и соодветно ни е дадена и реалната вредност за тоа со која веројатност текстот спаѓа во таа категорија.

Бидејќи не специфициравме кој модел ќе се користи, се користи default моделот, во случајот distilbert-base-uncased-finetuned-sst-2-english.

In [4]:
nlp = pipeline("sentiment-analysis")

print(nlp("I hate you"))
print(nlp("I love you"))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]


Можеме да пробаме со најразлични пример реченици.

In [5]:
print(nlp("I am in love with mathematics"))

[{'label': 'POSITIVE', 'score': 0.999725878238678}]


In [6]:
print(nlp("Today is a bad day"))

[{'label': 'NEGATIVE', 'score': 0.999760091304779}]


In [7]:
print(nlp("The sun is shining"))

[{'label': 'POSITIVE', 'score': 0.9998611211776733}]


In [8]:
print(nlp("The presentation was not great"))

[{'label': 'NEGATIVE', 'score': 0.9997530579566956}]


## Одговарање на прашања



Extractive Question Answering е вусшност одговарање на прашања за кои одговорот се наоѓа во рамки на даден текст кој исто така е даден. 

Резултатот кој тука го добиваме е одговорот кој е извлечен од текстот, како и "start" и "end" вредности кои ја даваат позицијата на одговорот во рамки на самиот текст.

Во ваквиот случај, кажуваме дека задачата што ја извршуваме е "question-answering".

In [9]:
nlp = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'score': 0.622244656085968, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5115318894386292, 'start': 147, 'end': 160, 'answer': 'SQuAD dataset'}


Можеме да пробаме и со уште еден пример.

In [10]:
context = r"""
Today is a very sunny day. The sun has been shining from 8 am until now. I am very happy because of that.
"""

print(nlp(question="Is it sunny today?", context=context))

{'score': 0.23926770687103271, 'start': 1, 'end': 26, 'answer': 'Today is a very sunny day'}


## Идентификација на именувани ентитети

Named Entity Recognition (NER) или идентификација на именувани ентитети е задачата за класифицирање на токени зависно одредена класа, така на пример идентификување на токен како личност, организација итн.

Во следниот пример правиме идентификација на именувани ентитети, кои може да припаѓаат на една од следните девет категории:

> O: Outside of a named entity (не е именуван ентитет)

> B-MIS: Beginnging of a miscellaneous entity right after another miscellaneous entity (значи дека ентитетот е различен и следи по таков)

> I-MIS: Miscellaneous entity

> B-PER: Beginning of a person's name right afer another person's name

> I-PER: Person's name

> B-ORG: Beginning of an organisation right after another organisation

> I-ORG: Organisation

> B-LOC: Beginning of a location right after another location

> I-LOC: Location

In [11]:
nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

[{'entity': 'I-ORG', 'score': 0.9995633, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.99159384, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9982672, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9994404, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.99943465, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99932706, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9993865, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}, {'entity': 'I-LOC', 'score': 0.9825621, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}, {'entity': 'I-LOC', 'score': 0.93698275, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}, {'entity': 'I-LOC', 'score': 0.8987098, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}, {'entity': 'I-LOC', 'score': 0.97582406, 'index': 29, 'word': 'Manha

In [13]:
sequence = "FCSE is a faculty that is based in Skopje, North Macedonia, established in 2011."

print(nlp(sequence))

[{'entity': 'I-ORG', 'score': 0.9978166, 'index': 1, 'word': 'FC', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.989217, 'index': 2, 'word': '##SE', 'start': 2, 'end': 4}, {'entity': 'I-LOC', 'score': 0.9992031, 'index': 10, 'word': 'S', 'start': 35, 'end': 36}, {'entity': 'I-LOC', 'score': 0.99813515, 'index': 11, 'word': '##ko', 'start': 36, 'end': 38}, {'entity': 'I-LOC', 'score': 0.9821137, 'index': 12, 'word': '##p', 'start': 38, 'end': 39}, {'entity': 'I-LOC', 'score': 0.9982169, 'index': 13, 'word': '##je', 'start': 39, 'end': 41}, {'entity': 'I-LOC', 'score': 0.9962871, 'index': 15, 'word': 'North', 'start': 43, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9984356, 'index': 16, 'word': 'Macedonia', 'start': 49, 'end': 58}]


## Сумаризација на текст

Сумаризација е задачата на сумаризирање на еден текст во помал текст.
Во случајот се користи default моделот, кој е Bart моделот.

In [14]:
summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]


## Генерирање на текст


Генерирање на текст е задача со цел креирање на кохерентен текст кој е продолжеток на даден почетен текст и контекст. 

Избираме соодветна задача на самиот pipeline и дефинираме и максимална должина на генерираниот текст.



In [15]:
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50))

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will have more than enough gold to buy a few more, as well as gold-enhanced shoes. I hope I will find at least one of these and it will cost me time and money anyway so you'}]


Може да го пробаме истиот почетен текст и да видиме различни верзии на генериран текст.

In [16]:
print(text_generator("As far as I am concerned, I will", max_length=50))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will continue to support him. And I\'ll watch as he continues to get elected as governor."'}]


In [17]:
print(text_generator("As far as I am concerned, I will", max_length=50))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will get a deal.'}]


In [18]:
print(text_generator("As far as I am concerned, I will", max_length=50))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "As far as I am concerned, I will give you plenty of time until the start of the tournament. Please let me know so I don't have to wait too long before doing any more tests. Thanks for listening to our team, we hope you"}]


## Превод на текст

Во минатата вежба видовме еден пример за превод на текст, односно преведување на текст од еден во друг јазик.
Во оваа вежба ќе видиме дека можеме истата задача да ја извршиме и со ваков модел кој се базира на трансформери и да добиеме подобри перформанси, на поедноставен начин.
Сепак, кога би сакале да вклучиме повеќе јазици и опции, би имало потреба од измени.

Во овој пример ќе преведуваме од Англиски во Германски јазик.

Default модел во случајот е T5 моделот.


In [19]:
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]


Ќе ја преведеме истата реченица и на Француски јазик.

In [20]:
translator = pipeline("translation_en_to_fr")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


[{'translation_text': 'Hugging Face est une entreprise technologique basée à New York et à Paris.'}]
