In [3]:
!pip install transformers



In [4]:
import transformers

# Question Answering

* Easy to run with standard tools
* Models default to English
* But you can find many more at the HuggingFace model repository

In [2]:
# This will load a default English model trained for Question Answering
pipe = transformers.pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


In [3]:
context = """Forbes has ranked Finland as the world's sixth best country for business in its annual list,
the Best Countries for Business. Finland is the best country in the world in terms of individual and property rights,
the  second best in terms of the innovation landscape and
the third best in terms of corruption, the American business magazine lists."""

pipe(question="What is Finland's rank in terms of corruption?", context=context)

{'score': 0.5244104266166687, 'start': 273, 'end': 278, 'answer': 'third'}

# Question Answering tasks

## Task 1

* Pick a few short paragraphs from your favorite source
* Come up with a few questions with an answer in the text and test the QA model
* How well did the model fare?
* Was it easy to come up with non-trivial questions? Think whether this approach is a good approach to build a QA dataset.

In [31]:
qa_pipe = transformers.pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [22]:
p1 = """Serbia, officially the Republic of Serbia, is a landlocked country at the crossroads of Southeast and Central Europe, located in the Balkans and the Pannonian Plain.
It borders Hungary to the north, Romania to the northeast, Bulgaria to the southeast,
North Macedonia to the south, Croatia and Bosnia and Herzegovina to the west, and Montenegro to the southwest. Serbia claims a border with Albania through the disputed territory of Kosovo.
In 2008, representatives of the Assembly of Kosovo unilaterally declared independence, with mixed responses from the international community while Serbia continues to claim it as part of its own sovereign territory.
Serbia has about 6.6 million inhabitants, excluding Kosovo. Its capital Belgrade is also the largest city."""

p2 = """By the mid-16th century, the Ottomans annexed the entirety of modern-day Serbia; their rule was at times interrupted by the Habsburg Empire,
which began expanding towards Central Serbia from the end of the 17th century while maintaining a foothold in Vojvodina. In the early 19th century,
the Serbian Revolution established the nation-state as the region's first constitutional monarchy, which subsequently expanded its territory. In 1918, in the aftermath of World War I,
the Kingdom of Serbia united with the former Habsburg crownland of Vojvodina; later in the same year it joined with other South Slavic nations in the foundation of Yugoslavia, which existed in various political formations until the Yugoslav Wars of the 1990s."""

p3 = """Serbia is an upper-middle income economy and provides universal health care and free primary and secondary education to its citizens.
It is a unitary parliamentary constitutional republic, member of the UN, Council of Europe, OSCE, PfP, BSEC, CEFTA, and is acceding to the WTO.
Since 2014, the country has been negotiating its EU accession, with the possibility of joining the European Union by 2030. Serbia formally adheres to the policy of military neutrality."""

In [32]:
question1 = "What is the capital and largest city of Serbia?"
qa_pipe(question=question1, context=p1)

{'score': 0.9765921831130981, 'start': 732, 'end': 740, 'answer': 'Belgrade'}

In [33]:
question2 = "Which countries border Serbia?"
qa_pipe(question=question2, context=p1)

{'score': 0.4288192689418793,
 'start': 88,
 'end': 116,
 'answer': 'Southeast and Central Europe'}

In [34]:
question3 = "Which country borders Serbia to the northeast?"
qa_pipe(question=question3, context=p1)

{'score': 0.761277973651886, 'start': 200, 'end': 207, 'answer': 'Romania'}

In [35]:
question4 = "Why does Serbia claim a border with Albania?"
qa_pipe(question=question4, context=p1)

{'score': 0.41425397992134094,
 'start': 401,
 'end': 441,
 'answer': 'through the disputed territory of Kosovo'}

In [36]:
question5 = "What major empire annexed Serbia in the mid-16th century?"
qa_pipe(question=question5, context=p2)

{'score': 0.5047846436500549, 'start': 25, 'end': 37, 'answer': 'the Ottomans'}

In [37]:
question6 = "What major historical event led to the formation of Yugoslavia?"
qa_pipe(question=question6, context=p2)

{'score': 0.3304097652435303,
 'start': 704,
 'end': 721,
 'answer': 'the Yugoslav Wars'}

In [38]:
question7 = "What is Serbia’s political system?"
qa_pipe(question=question7, context=p3)

{'score': 0.298593670129776,
 'start': 143,
 'end': 188,
 'answer': 'unitary parliamentary constitutional republic'}

In [39]:
question8 = "When did Serbia begin negotiating its EU accession?"
qa_pipe(question=question8, context=p3)

{'score': 0.9015503525733948, 'start': 286, 'end': 290, 'answer': '2014'}

In [40]:
question9 = "What are some international organizations Serbia is a member of?"
qa_pipe(question=question9, context=p3)

{'score': 0.00608057202771306,
 'start': 204,
 'end': 242,
 'answer': 'UN, Council of Europe, OSCE, PfP, BSEC'}

The model performed relatively well, even though the scores for some correct answers are not that high. It struggled with bordering countries of Serbia, and with the historical event which led to the formation of Yugoslavia. Why Serbia claims a border with Albania was not answered correctly at all, giving answer to how rather than why. This question required more reasoning than others.

It was easy to come up with questions of varying difficulty, which is important for training robust models. This is a good approach to build QA dataset, because a human oversees answers and can make deeply contextual questions.

## Task 2

* If you can access ChatGPT, ask it the same questions. Did it succeed?

In [50]:
chatgpt_qa_pipe = transformers.pipeline("text-generation", model="openai-community/gpt2-large")

Device set to use cpu


In [77]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p1}\nQuestion: {question1}\nAnswer:", max_new_tokens=10)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia, officially the Republic of Serbia, is a landlocked country at the crossroads of Southeast and Central Europe, located in the Balkans and the Pannonian Plain. 
It borders Hungary to the north, Romania to the northeast, Bulgaria to the southeast, 
North Macedonia to the south, Croatia and Bosnia and Herzegovina to the west, and Montenegro to the southwest. Serbia claims a border with Albania through the disputed territory of Kosovo. 
In 2008, representatives of the Assembly of Kosovo unilaterally declared independence, with mixed responses from the international community while Serbia continues to claim it as part of its own sovereign territory.
Serbia has about 6.6 million inhabitants, excluding Kosovo. Its capital Belgrade is also the largest city.
Question: What is the capital and largest city of Serbia?
Answer: Belgrade. 
Question: What is the


In [81]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p1}\nQuestion: {question2}\nAnswer:", max_new_tokens=30)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia, officially the Republic of Serbia, is a landlocked country at the crossroads of Southeast and Central Europe, located in the Balkans and the Pannonian Plain. 
It borders Hungary to the north, Romania to the northeast, Bulgaria to the southeast, 
North Macedonia to the south, Croatia and Bosnia and Herzegovina to the west, and Montenegro to the southwest. Serbia claims a border with Albania through the disputed territory of Kosovo. 
In 2008, representatives of the Assembly of Kosovo unilaterally declared independence, with mixed responses from the international community while Serbia continues to claim it as part of its own sovereign territory.
Serbia has about 6.6 million inhabitants, excluding Kosovo. Its capital Belgrade is also the largest city.
Question: Which countries border Serbia?
Answer: Albania, Bosnia and Herzegovina, Bulgaria, Croatia, Montenegro, Macedonia, and Serbia.
Question: In which border 

In [88]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p1}\nQuestion: {question3}\nAnswer:", max_new_tokens=15)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia, officially the Republic of Serbia, is a landlocked country at the crossroads of Southeast and Central Europe, located in the Balkans and the Pannonian Plain. 
It borders Hungary to the north, Romania to the northeast, Bulgaria to the southeast, 
North Macedonia to the south, Croatia and Bosnia and Herzegovina to the west, and Montenegro to the southwest. Serbia claims a border with Albania through the disputed territory of Kosovo. 
In 2008, representatives of the Assembly of Kosovo unilaterally declared independence, with mixed responses from the international community while Serbia continues to claim it as part of its own sovereign territory.
Serbia has about 6.6 million inhabitants, excluding Kosovo. Its capital Belgrade is also the largest city.
Question: Which country borders Serbia to the northeast?
Answer: Montenegro.
Question: Which country borders Serbia to the northeast?



In [71]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p1}\nQuestion: {question4}\nAnswer:", max_new_tokens=40)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia, officially the Republic of Serbia, is a landlocked country at the crossroads of Southeast and Central Europe, located in the Balkans and the Pannonian Plain. 
It borders Hungary to the north, Romania to the northeast, Bulgaria to the southeast, 
North Macedonia to the south, Croatia and Bosnia and Herzegovina to the west, and Montenegro to the southwest. Serbia claims a border with Albania through the disputed territory of Kosovo. 
In 2008, representatives of the Assembly of Kosovo unilaterally declared independence, with mixed responses from the international community while Serbia continues to claim it as part of its own sovereign territory.
Serbia has about 6.6 million inhabitants, excluding Kosovo. Its capital Belgrade is also the largest city.
Question: Why does Serbia claim a border with Albania?
Answer: The majority of Serbs do not support the independence of Kosovo
In fact, Serbs believe that Kosovo,

In [74]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p2}\nQuestion: {question5}\nAnswer:", max_new_tokens=15)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: By the mid-16th century, the Ottomans annexed the entirety of modern-day Serbia; their rule was at times interrupted by the Habsburg Empire, 
which began expanding towards Central Serbia from the end of the 17th century while maintaining a foothold in Vojvodina. In the early 19th century, 
the Serbian Revolution established the nation-state as the region's first constitutional monarchy, which subsequently expanded its territory. In 1918, in the aftermath of World War I, 
the Kingdom of Serbia united with the former Habsburg crownland of Vojvodina; later in the same year it joined with other South Slavic nations in the foundation of Yugoslavia, which existed in various political formations until the Yugoslav Wars of the 1990s.
Question: What major empire annexed Serbia in the mid-16th century?
Answer: The Ottoman Empire was one of the dominant political and military powers in central Europe


In [89]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p2}\nQuestion: {question6}\nAnswer:", max_new_tokens=15)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: By the mid-16th century, the Ottomans annexed the entirety of modern-day Serbia; their rule was at times interrupted by the Habsburg Empire, 
which began expanding towards Central Serbia from the end of the 17th century while maintaining a foothold in Vojvodina. In the early 19th century, 
the Serbian Revolution established the nation-state as the region's first constitutional monarchy, which subsequently expanded its territory. In 1918, in the aftermath of World War I, 
the Kingdom of Serbia united with the former Habsburg crownland of Vojvodina; later in the same year it joined with other South Slavic nations in the foundation of Yugoslavia, which existed in various political formations until the Yugoslav Wars of the 1990s.
Question: What major historical event led to the formation of Yugoslavia?
Answer: The outbreak of the First World War marked the most significant period of Serbian history


In [96]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p3}\nQuestion: {question7}\nAnswer:", max_new_tokens=25)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia is an upper-middle income economy and provides universal health care and free primary and secondary education to its citizens. 
It is a unitary parliamentary constitutional republic, member of the UN, Council of Europe, OSCE, PfP, BSEC, CEFTA, and is acceding to the WTO. 
Since 2014, the country has been negotiating its EU accession, with the possibility of joining the European Union by 2030. Serbia formally adheres to the policy of military neutrality.
Question: What is Serbia’s political system?
Answer: Serbia is a parliamentary republic based on a multiparty electoral system which has the power to elect the President, the Prime Minister,


In [97]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p3}\nQuestion: {question8}\nAnswer:", max_new_tokens=25)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia is an upper-middle income economy and provides universal health care and free primary and secondary education to its citizens. 
It is a unitary parliamentary constitutional republic, member of the UN, Council of Europe, OSCE, PfP, BSEC, CEFTA, and is acceding to the WTO. 
Since 2014, the country has been negotiating its EU accession, with the possibility of joining the European Union by 2030. Serbia formally adheres to the policy of military neutrality.
Question: When did Serbia begin negotiating its EU accession?
Answer: In December 2014, European Council approved the EU accession roadmap.
Question: What are the most important issues in the negotiations


In [104]:
response = chatgpt_qa_pipe(f"Answer the following question based on the provided context:\n\nContext: {p3}\nQuestion: {question9}\nAnswer:", max_new_tokens=40)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the following question based on the provided context:

Context: Serbia is an upper-middle income economy and provides universal health care and free primary and secondary education to its citizens. 
It is a unitary parliamentary constitutional republic, member of the UN, Council of Europe, OSCE, PfP, BSEC, CEFTA, and is acceding to the WTO. 
Since 2014, the country has been negotiating its EU accession, with the possibility of joining the European Union by 2030. Serbia formally adheres to the policy of military neutrality.
Question: What are some international organizations Serbia is a member of?
Answer:  UN:  Units of the UN are part of the United Nations system [EU].  As a member of the UN we are subject to the rules and regulations published in the UN Charter


From what I have seen, openAI API doesn't offer free models anymore, so I used an open source model from hugging face but it doesn't work the best for simple fact checks; however it seems to reason better than the other model, meaning it is able to answer some more complex questions.

## Task 3

* Explore the Hugging Face model repository. https://huggingface.co/models
* Try some of the QA models for a language of your interest. Can you find any? Below you can see a model for Finnish.

In [118]:
p1_serbian = """Srbija, zvanično Republika Srbija, je zemlja bez izlaza na more na raskrsnici jugoistočne i centralne Evrope, koja se nalazi na Balkanu i Panonskoj niziji.
Graniči se sa Mađarskom na severu, Rumunijom na severoistoku, Bugarskom na jugoistoku,
Severnom Makedonijom na jugu, Hrvatskom i Bosnom i Hercegovinom na zapadu, i Crnom Gorom na jugozapadu. Srbija tvrdi da ima granicu sa Albanijom preko sporne teritorije Kosova.
Godine 2008, predstavnici Skupštine Kosova jednostrano su proglasili nezavisnost, uz podeljene reakcije međunarodne zajednice, dok Srbija i dalje smatra Kosovo delom svoje suverene teritorije.
Srbija ima oko 6,6 miliona stanovnika, ne računajući Kosovo. Njen glavni grad Beograd je ujedno i najveći grad."""

p2_serbian = """Do sredine 16. veka, Osmanlije su aneksirale celu teritoriju današnje Srbije; njihova vlast je povremeno bila prekidana od strane Habsburške monarhije,
koja je krajem 17. veka počela da se širi prema centralnoj Srbiji, dok je istovremeno zadržavala uporište u Vojvodini. Početkom 19. veka,
Srpska revolucija uspostavila je nacionalnu državu kao prvu ustavnu monarhiju u regionu, koja je kasnije proširila svoju teritoriju. Godine 1918, nakon Prvog svetskog rata,
Kraljevina Srbija se ujedinila sa bivšom habsburškom krunskom zemljom Vojvodinom; kasnije iste godine pridružila se i drugim južnoslovenskim narodima u osnivanju Jugoslavije, koja je postojala u različitim političkim formacijama sve do jugoslovenskih ratova 1990-ih godina."""

p3_serbian = """Srbija je privreda srednje-višeg dohotka i svojim građanima obezbeđuje univerzalnu zdravstvenu zaštitu i besplatno osnovno i srednje obrazovanje.
To je unitarna parlamentarna ustavna republika, članica UN, Saveta Evrope, OEBS-a, PfP-a, BSEC-a, CEFTA-e, i u procesu je pridruživanja STO-u.
Od 2014. godine, zemlja pregovara o svom pridruživanju Evropskoj uniji, sa mogućnošću pristupanja do 2030. godine. Srbija se formalno pridržava politike vojne neutralnosti."""

In [106]:
qa_pipe_serbian = transformers.pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-polish-squad1",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad1"
)

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/711M [00:00<?, ?B/s]

Some weights of the model checkpoint at henryk/bert-base-multilingual-cased-finetuned-polish-squad1 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/711M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [112]:
qa_pipe_serbian(question="Koji je glavni grad Srbije?", context=p1_serbian)

{'score': 0.9109963774681091, 'start': 695, 'end': 702, 'answer': 'Beograd'}

In [114]:
qa_pipe_serbian(question="Koje zemlje se granice sa Srbijom?", context=p1_serbian)

{'score': 0.026008082553744316,
 'start': 157,
 'end': 180,
 'answer': 'Graniči se sa Mađarskom'}

In [116]:
qa_pipe_serbian(question="Koje zemlja se granici sa Srbijom na jugu?", context=p1_serbian)

{'score': 0.1310046762228012, 'start': 0, 'end': 6, 'answer': 'Srbija'}

In [117]:
qa_pipe_serbian(question="Zašto Srbija tvrdi da ima granicu sa Albanijom?", context=p1_serbian)

{'score': 0.8857381343841553,
 'start': 390,
 'end': 420,
 'answer': 'preko sporne teritorije Kosova'}

In [119]:
qa_pipe_serbian(question="Koje veliko carstvo je aneksiralo Srbiju sredinom 16. veka?", context=p2_serbian)

{'score': 0.9800374507904053, 'start': 21, 'end': 30, 'answer': 'Osmanlije'}

In [120]:
qa_pipe_serbian(question="Koji veliki istorijski događaj je doveo do formiranja Jugoslavije?", context=p2_serbian)

{'score': 0.6147505640983582,
 'start': 292,
 'end': 309,
 'answer': 'Srpska revolucija'}

In [121]:
qa_pipe_serbian(question="Kakav je politički sistem Srbije?", context=p3_serbian)

{'score': 0.0916227176785469,
 'start': 444,
 'end': 463,
 'answer': 'vojne neutralnosti.'}

In [122]:
qa_pipe_serbian(question="Kada je Srbija počela pregovore o pridruživanju EU?", context=p3_serbian)

{'score': 0.9228823781013489, 'start': 294, 'end': 298, 'answer': '2014'}

In [123]:
qa_pipe_serbian(question="Članica kojih međunarodnih organizacija je Srbija?", context=p3_serbian)

{'score': 0.16740962862968445, 'start': 203, 'end': 205, 'answer': 'UN'}

## Finnish

In [None]:
qa_pipe_finnish = transformers.pipeline("question-answering", model="TurkuNLP/bert-base-finnish-cased-squad2")

Downloading (…)lve/main/config.json:   0%|          | 0.00/648 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/307 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/424k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
context="""PIK-15 Hinu (lyh. sanasta Hinauskone) on suomalainen puurakenteinen yksimoottorinen purjelentokoneiden hinauskone. Hinua lähdettiin suunnittelemaan ulkomaisten hinauskoneiden hankinnan osoittauduttua hankalaksi tuontilisenssiongelmien takia. Polyteknikkojen ilmailukerho (PIK) järjesti konetyypin suunnittelusta kilpailun, jonka voitti Kai Mellénin konesuunnitelma. Mellén valittiin lopulta myös Hinun pääsuunnittelijaksi, ja tyypin suunnittelutyöt aloitettiin syksyllä 1960. Prototyypin rakennustyöt käynnistyivät keväällä 1962.

Prototyyppi valmistui ja sen ensilento suoritettiin 24. lokakuuta 1964. Pitkässä koelento-ohjelmassa konetyypin suurimmaksi ongelmaksi paljastui sen syöksykierreominaisuudet, joihin saatiin pientä parannusta jatkamalla runkoa. Hinuja valmistui 1970-luvun loppuun mennessä kuusi koneyksilöä, minkä lisäksi yksittäisiä koneita on valmistunut myöhemminkin.[2] Suomen ilma-alusrekisterissä oli kahdeksan koneyksilöä vuonna 2015.[3]
"""
qa_pipe_finnish(question="Miten PIK-15 suurin ongelma korjattiin?", context=context)

{'score': 0.8015400767326355,
 'start': 739,
 'end': 756,
 'answer': 'jatkamalla runkoa'}

# Summarization

* text2text problem

In [5]:
sum_pipe = transformers.pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


In [6]:
context="""
Raphael Jondot, originally from France, told Yle that he would like to settle down in eastern Finland's region of North Karelia. The reason behind his choice was a little surprising.

"I like the people and the food here. Karelian food is really tasty," Jondot said, assuring Yle that he was serious.

Jondot has been looking for work as a maintenance technician or maintenance engineer since last autumn.

A report published last week by the think tank Etla Economic Research Institute suggested that Finland needs to triple its immigration to maintain the country's labour force.

However, many foreigners coming to Finland still face significant hurdles. The Finnish Immigration Service (Migri) has been plagued by backlogs for years, despite making efforts at reforms over the past year. An October survey commissioned by the nonprofit group E2 Research also found that two-out-of-five foreign specialists in Finland face discrimination.
"""

sum_pipe(context)
# Note: you can do this:
# pipe(context,min_length=15,max_length=30)

[{'summary_text': ' Raphael Jondot, originally from France, wants to settle in eastern Finland\'s North Karelia region . He says he likes the people and the food in the region, and Karelian food is "really tasty" Many foreigners coming to Finland still face significant hurdles, including discrimination .'}]

# Summarization tasks

## Task 1

* Pick a text of your liking, and test the summarization pipeline below on it. Does it do a decent job?

In [7]:
sum_pipe = transformers.pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [35]:
context = """
The Rise of Artificial Intelligence in Healthcare

Artificial intelligence (AI) has increasingly become a crucial part of the healthcare industry, revolutionizing diagnostics, treatment planning, and patient care. AI-powered tools assist doctors in analyzing medical images, detecting diseases such as cancer at an early stage, and predicting patient outcomes more accurately than traditional methods. Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans.

One of the most significant advancements is the use of AI in medical imaging. Deep learning models trained on thousands of X-rays, MRIs, and CT scans can identify abnormalities with high precision, sometimes even outperforming human radiologists. AI is also transforming electronic health records by streamlining administrative tasks, reducing paperwork, and allowing doctors to focus more on patient care.

Despite its benefits, AI in healthcare also presents challenges. Concerns about data privacy, biases in AI models, and the need for regulatory approvals can slow down adoption. Additionally, while AI can enhance medical decision-making, it is not meant to replace doctors but rather to support them in providing better patient outcomes.

As technology advances, AI’s role in healthcare is expected to grow, leading to improved efficiency, reduced costs, and better access to quality medical services worldwide. However, careful implementation and ethical considerations will be essential in ensuring that AI benefits all patients fairly and safely.
"""

context_serbian = """
Veštačka inteligencija (VI) sve više postaje ključni deo medicinske industrije, revolucionizujući dijagnostiku, planiranje terapije i brigu o pacijentima. Alati zasnovani na VI pomažu lekarima u analizi medicinskih snimaka, ranom otkrivanju bolesti poput raka i preciznijem predviđanju ishoda lečenja u poređenju sa tradicionalnim metodama. Algoritmi mašinskog učenja mogu obraditi ogromne količine medicinskih podataka, pomažući istraživačima u razvoju novih lekova i personalizovanih terapija.

Jedan od najznačajnijih napredaka je primena VI u medicinskom snimanjima. Modeli dubokog učenja, obučeni na hiljadama rendgenskih, MR i CT snimaka, mogu sa velikom preciznošću identifikovati abnormalnosti, ponekad čak i bolje od ljudskih radiologa. VI takođe menja način vođenja elektronskih zdravstvenih kartona, automatizujući administrativne zadatke, smanjujući papirologiju i omogućavajući lekarima da više vremena posvete pacijentima.

Iako donosi mnoge koristi, primena VI u medicini suočava se i sa izazovima. Pitanja privatnosti podataka, pristrasnosti u modelima VI i potreba za regulatornim odobrenjima mogu usporiti njeno usvajanje. Pored toga, iako VI može poboljšati medicinske odluke, njen cilj nije da zameni lekare, već da ih podrži u pružanju kvalitetnije nege pacijentima.

Kako tehnologija napreduje, očekuje se da će uloga VI u medicini nastaviti da raste, poboljšavajući efikasnost, smanjujući troškove i omogućavajući bolji pristup kvalitetnim medicinskim uslugama širom sveta. Ipak, neophodna je pažljiva implementacija i etički pristup kako bi se osiguralo da VI koristi svim pacijentima na pravičan i siguran način.
"""

In [8]:
sum_pipe(context)

[{'summary_text': ' Artificial Intelligence has become a crucial part of the healthcare industry . Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans . AI is also transforming electronic health records by streamlining administrative tasks . Concerns about data privacy and biases in AI models can slow down adoption .'}]

It does a good job of summarizing the content.

## Task 2

* The pipeline accepts a `min_length` and `max_length` parameters when you run it, try to adjust these to various lengths and see if the model comes up with a reasonable summarization choice.

In [21]:
sum_pipe(context, min_length=10, max_length=50)

[{'summary_text': ' Artificial Intelligence has become a crucial part of the healthcare industry . Machine learning algorithms can process vast amounts of medical data . AI is also transforming electronic health records by streamlining administrative tasks .'}]

In [24]:
sum_pipe(context, min_length=50, max_length=100)

[{'summary_text': ' Artificial Intelligence has become a crucial part of the healthcare industry . Machine learning algorithms can process vast amounts of medical data . AI is also transforming electronic health records by streamlining administrative tasks . Concerns about data privacy and biases in AI models can slow down adoption .'}]

In [23]:
sum_pipe(context, min_length=100, max_length=200)

[{'summary_text': ' Artificial Intelligence has become a crucial part of the healthcare industry . Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans . AI is also transforming electronic health records by streamlining administrative tasks . Concerns about data privacy and biases in AI models can slow down adoption of the technology . As technology advances, AI’s role in AI is expected to grow, leading to improved efficiency, reduced costs, and better access to quality medical services worldwide . However, careful implementation and ethical considerations will be essential in ensuring that AI benefits all patients fairly and safely .'}]

## Task 3

* Explore the Hugging Face models repository and experiment with summarization models for other languages of your choice. Do they exist? Do they work?

In [37]:
sum_pipe_multi = transformers.pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")

Device set to use cpu


In [36]:
sum_pipe_multi(context_serbian, min_length=100, max_length=200)

[{'summary_text': 'Uloga veštačke inteligencije u medicini će nastaviti da napreduje, pokazuje studija Ujedinjenih nacija za medicinu koja se bavi proizvodnjom i analizom medicinskih podataka, a naučnici kažu da je cilj njenog učesnika da poboljša efikasnost medicinske nege pacijen BBC.. . (Ovaj članak sadrži neke od najvažnijih cilja ove tehnologije).'}]

In [43]:
sum_pipe_german = transformers.pipeline("summarization", model="Shahm/bart-german")

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


In [46]:
context_german = """
Künstliche Intelligenz (KI) wird zunehmend zu einem entscheidenden Bestandteil der Medizinindustrie und revolutioniert die Diagnose, Therapieplanung und Patientenversorgung. KI-gestützte Werkzeuge helfen Ärzten bei der Analyse medizinischer Bilder, der Früherkennung von Krankheiten wie Krebs und der genaueren Vorhersage von Behandlungsergebnissen im Vergleich zu traditionellen Methoden. Maschinelle Lernalgorithmen können enorme Mengen an medizinischen Daten verarbeiten und Forschern bei der Entwicklung neuer Medikamente und personalisierter Therapien helfen.

Einer der bedeutendsten Fortschritte ist der Einsatz von KI in der medizinischen Bildgebung. Tiefenlernmodelle, die mit Tausenden von Röntgen-, MRT- und CT-Bildern trainiert wurden, können Anomalien mit hoher Präzision identifizieren – manchmal sogar besser als menschliche Radiologen. KI verändert auch die Verwaltung elektronischer Patientenakten, automatisiert administrative Aufgaben, reduziert den Papieraufwand und ermöglicht es Ärzten, mehr Zeit mit Patienten zu verbringen.

Trotz der vielen Vorteile steht der Einsatz von KI in der Medizin vor Herausforderungen. Fragen des Datenschutzes, Verzerrungen in KI-Modellen und die Notwendigkeit regulatorischer Genehmigungen können ihre Einführung verlangsamen. Zudem ist KI nicht dazu gedacht, Ärzte zu ersetzen, sondern sie in ihrer Arbeit zu unterstützen und die Qualität der Patientenversorgung zu verbessern.

Mit dem technologischen Fortschritt wird erwartet, dass die Rolle der KI in der Medizin weiter wächst, die Effizienz steigert, Kosten senkt und den Zugang zu hochwertigen Gesundheitsdiensten weltweit verbessert. Dennoch ist eine sorgfältige Implementierung und ein ethischer Ansatz erforderlich, um sicherzustellen, dass KI allen Patienten auf faire und sichere Weise zugutekommt.
"""

sum_pipe_german(context_german, min_length=100, max_length=200)

[{'summary_text': 'Künstliche Intelligenz (KI) wird zunehmend zu einem entscheidenden Bestandteil der Medizinindustrie und revolutioniert die Diagnose, Therapieplanung und Patientenversorgung. KI-gestützte Werkzeuge helfen Ärzten bei der Analyse medizinischer Bilder, der Früherkennung von Krankheiten wie Krebs und der genaueren Vorhersage von Behandlungsergebnissen im Verg'}]

In [48]:
sum_pipe_german(context_serbian, min_length=100, max_length=200)

[{'summary_text': 'Alati zasnovani na VI pomažu lekarima u analizi medicinskih snimaka, ranom otkrivanju bolesti poput raka i preciznijem predviđanju ishoda lečenja u poređenju sa tradicionalnim metodama. Algoritmi mašinskog učenj mogu obraditi ogromne količine medicinskihl podataka, pomaß�ući istraživačima u razvo'}]

It is not the easiest job to find models which work with Serbian (my native language). Usually they are part of some multi-language models, like the one used in this task. They work, meaning summary is a complete thought and does make sense, but of course it is not as detailed or as correct as the English version. Additionally, it talks about some articles2, which probably come from the generation ability, or the corpus it was trained on.

The german model worked better and effectively summarized the german context, and surprisingly worked better in summarizing serbian context as well.

## Task 4

* If you can access ChatGPT, try to have it summarize a piece of text. Did it work as you expected? Were you able to control the length effectively? How about other languages?

In [12]:
chatgpt_sum_pipe = transformers.pipeline("text-generation", model="openai-community/gpt2-large")

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [14]:
response = chatgpt_sum_pipe(f"Summarize the following text:\n\nText: {context}\nSummary:", max_new_tokens=50)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarize the following text:

Text: 
The Rise of Artificial Intelligence in Healthcare

Artificial intelligence (AI) has increasingly become a crucial part of the healthcare industry, revolutionizing diagnostics, treatment planning, and patient care. AI-powered tools assist doctors in analyzing medical images, detecting diseases such as cancer at an early stage, and predicting patient outcomes more accurately than traditional methods. Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans.

One of the most significant advancements is the use of AI in medical imaging. Deep learning models trained on thousands of X-rays, MRIs, and CT scans can identify abnormalities with high precision, sometimes even outperforming human radiologists. AI is also transforming electronic health records by streamlining administrative tasks, reducing paperwork, and allowing doctors to focus more on patient care.

Despite i

In [15]:
response = chatgpt_sum_pipe(f"Summarize the following text:\n\nText: {context}\nSummary:", max_new_tokens=100)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarize the following text:

Text: 
The Rise of Artificial Intelligence in Healthcare

Artificial intelligence (AI) has increasingly become a crucial part of the healthcare industry, revolutionizing diagnostics, treatment planning, and patient care. AI-powered tools assist doctors in analyzing medical images, detecting diseases such as cancer at an early stage, and predicting patient outcomes more accurately than traditional methods. Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans.

One of the most significant advancements is the use of AI in medical imaging. Deep learning models trained on thousands of X-rays, MRIs, and CT scans can identify abnormalities with high precision, sometimes even outperforming human radiologists. AI is also transforming electronic health records by streamlining administrative tasks, reducing paperwork, and allowing doctors to focus more on patient care.

Despite i

In [18]:
response = chatgpt_sum_pipe(f"Summarize the following text:\n\nText: {context}\nSummary:", max_new_tokens=200)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarize the following text:

Text: 
The Rise of Artificial Intelligence in Healthcare

Artificial intelligence (AI) has increasingly become a crucial part of the healthcare industry, revolutionizing diagnostics, treatment planning, and patient care. AI-powered tools assist doctors in analyzing medical images, detecting diseases such as cancer at an early stage, and predicting patient outcomes more accurately than traditional methods. Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans.

One of the most significant advancements is the use of AI in medical imaging. Deep learning models trained on thousands of X-rays, MRIs, and CT scans can identify abnormalities with high precision, sometimes even outperforming human radiologists. AI is also transforming electronic health records by streamlining administrative tasks, reducing paperwork, and allowing doctors to focus more on patient care.

Despite i

In [26]:
response = chatgpt_sum_pipe(f"Summarize the following text and translate the summarization to German:\n\nText: {context}\nSummary:", max_new_tokens=100)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarize the following text and translate the summarization to German:

Text: 
The Rise of Artificial Intelligence in Healthcare

Artificial intelligence (AI) has increasingly become a crucial part of the healthcare industry, revolutionizing diagnostics, treatment planning, and patient care. AI-powered tools assist doctors in analyzing medical images, detecting diseases such as cancer at an early stage, and predicting patient outcomes more accurately than traditional methods. Machine learning algorithms can process vast amounts of medical data, helping researchers develop new drugs and personalized treatment plans.

One of the most significant advancements is the use of AI in medical imaging. Deep learning models trained on thousands of X-rays, MRIs, and CT scans can identify abnormalities with high precision, sometimes even outperforming human radiologists. AI is also transforming electronic health records by streamlining administrative tasks, reducing paperwork, and allowing doctors

In [28]:
response = chatgpt_sum_pipe(f"Summarize the following text:\n\nText: {context}\nSummary:", max_new_tokens=100)

text = response[0]["generated_text"].split("Summary:")[1]

response = chatgpt_sum_pipe(f"Translate the following text to German:\n\nText: {text}\nTranslation:", max_new_tokens=100)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Translate the following text to German:

Text: 

The rise of AI in healthcare is not only contributing to the development of smarter systems that are more efficient and capable, but also with a greater understanding of the role of AI in the healthcare industry.

The advent of AI in healthcare brings an entirely new set of ethical considerations that will need to be considered by healthcare professionals. AI technology also presents significant challenges, including privacy concerns, biased models, and the need for regulatory approvals. There are also potential risks to patient safety and privacy and the
Translation: In the future, AI-based medical diagnostics will contribute more to preventative care and healthcare efficiency by improving diagnosis methods used to diagnose conditions. These methods will include using machine learning algorithms to identify relevant conditions and potentially make better informed choices.

The following AI based systems will be used to diagnose diabetes

In [31]:
response = chatgpt_sum_pipe(f"Sumiraj sledeci tekst na srpskom jeziku:\n\nTekst: {context_serbian}\nSumiran tekst:", max_new_tokens=100)

print(response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sumiraj sledeci tekst na srpskom jeziku:

Tekst: 
Veštačka inteligencija (VI) sve više postaje ključni deo medicinske industrije, revolucionizujući dijagnostiku, planiranje terapije i brigu o pacijentima. Alati zasnovani na VI pomažu lekarima u analizi medicinskih snimaka, ranom otkrivanju bolesti poput raka i preciznijem predviđanju ishoda lečenja u poređenju sa tradicionalnim metodama. Algoritmi mašinskog učenja mogu obraditi ogromne količine medicinskih podataka, pomažući istraživačima u razvoju novih lekova i personalizovanih terapija.

Jedan od najznačajnijih napredaka je primena VI u medicinskom snimanjima. Modeli dubokog učenja, obučeni na hiljadama rendgenskih, MR i CT snimaka, mogu sa velikom preciznošću identifikovati abnormalnosti, ponekad čak i bolje od ljudskih radiologa. VI takođe menja način vođenja elektronskih zdravstvenih kartona, automatizujući administrativne zadatke, smanjujući papirologiju i omogućavajući lekarima da više vremena posvete pacijentima.

Iako donosi 

Since chat gpt is a generative model, it has its own way of working, not exactly specialized for any downstream tasks like qa and summarization. In this case, it was generating new information, outside of the context given, so it didn't perform as well as the model in Task 1 and 2.

I couldn't get it to provide the summarization in other language neither directly nor through two separate prompts. I also couldn't get it to answer in the target language even when the prompt and text are in the target language. (it answered in some other language, not Serbian)

Length can be controlled through the max_new_tokens parameter, but unlike the model in task 1 and 2, summarization is not a complete thought. When we set the length to 50, we want a complete summarization in 50 tokens, and chat gpt always stops in the middle of the sentence, so the actual text generation can hardly be meaningfully limited by max_new_tokens.