# Package 설치

In [1]:
%pip install transformers tokenizers datasets accelerate evaluate scikit-learn -qU

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install ipywidgets tqdm

Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
   ---------------------------------------- 0.0/2.3 MB ? eta -:--:--
   ---------------------------------------- 2.3/2.3 MB 129.6 MB/s eta 0:00:00
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.1.5 jupyterlab-widgets-3.0.13 widgetsnbextension-4.0.13
Note: you may need to restart the kernel to use updated packages.


In [1]:
import transformers
import datasets
import evaluate
import tokenizers
print(transformers.__version__)
print(datasets.__version__)
print(evaluate.__version__)
print(tokenizers.__version__)

4.46.3
3.1.0
0.4.3
0.20.3


# Hugging Face Pipeline을 이용한 모델 활용

- Pipeline은 Transformers 라이브러리의 가장 기본적인 객체로, **전처리 - 추론 -> 후처리** 로 이어지는 일련의 과정을 자동화하여 손쉽게 모델을 사용할 수 있게 해준다.
- Task에 따라 다양한 Pipeline 클래스를 제공하며 `pipeline` 함수를 이용해 쉽게 생성할 수 있다.
- **task만 지정**해서 기본 제공 모델과 토크나이저를 사용하거나 **직접 모델과 토크나이저를 지정**해 생성할 수 있다.
- https://huggingface.co/docs/transformers/pipeline_tutorial

![huggingface_pipeline.png](figures/huggingface_pipeline.png)

## 지원하는 주요 태스크
### 자연어 처리 태스크
- **text-classification**: 텍스트 분류
- **text-generation**: 텍스트 생성
- **translation**: 번역
- **summarization**: 요약
- **question-answering**: 질의응답
- **fill-mask**: 마스크 토큰 채우기
- **ner**: 개체명 인식
- **feature-extraction**: 특징 추출(context vector)

### 영상 처리 태스크
- **image-classification**: 이미지 분류
- **object-detection**
  -  객체 검출 (Object Detection)
  -  이미지 안에서 객체들의 위치와 class를 찾아내는 작업
- **image-segmentation**
  -  이미지 세분화 (Image Segmentation)
  -  이미지를 픽셀 단위로 분할하여 각 픽셀이 어떤 객체에 속하는지 분류하는 작업

## pipeline 함수
- 주요파라미터
  - **task:** 수행하려는 작업의 유형을 문자열로 지정.
  - **model:**
    - 사용할 사전 학습된 모델의 이름 또는 경로를 지정한다. 
    - 모델이름(ID)은 `[모델소유자이름]/[모델이름]` 형식이다. Hugging Face에서 제공하는 모델의 경우는 사용자명이 생략되어 있다. . (ex: "google/gemma-2-2b", "gpt2")
    - 모델을 명시적으로 지정하지 않으면, **task에 맞는 기본 모델이 로드**된다.
  - **tokenizer:** 사용할 토크나이저를 지정. 생략하면 모델과 같이 제공되(model과 이름이 같은 토크나이저)는 토크나이저를 사용한다.
  - **framework:** 사용할 딥러닝 프레임워크를 지정. 'pt'는 PyTorch(Default), 'tf'는 TensorFlow를 지정한다.
  - **device:** 모델을 실행할 디바이스를 지정합니다. -1은 CPU, 0은 첫 번째 GPU를 의미합니다.
  - **revision:** 모델의 특정 버전을 지정할 때 사용합니다.
  - **use_fast:** 
    - 빠른 토크나이저를 사용할지 여부를 지정합니다. 기본값은 True입니다.
    - 빠른 토크나이저는 `Rust` 언어로 구현되어 속도가 빠르다. 단 모든 모델에 대해 지원하지 않는다. 지원하지 않을 경우 `use_fast=True`로 설정해도 일반 토크나이저가 사용된다.

## 모델 검색
![huggingface_model_search.png](figures/huggingface_model_search.png)



### 텍스트 분류(감정분석)

In [2]:
# 경고 출력 막기
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
from transformers import pipeline
# Module Not Found Error, tf-keras => 발생. framework="pt"
model = pipeline(task="text-classification", framework="pt")
# (모델 다운로드)  모델, 토크나이저 로딩 -> 토큰화 -> 모델을 이용해 추론 -> 결과 후처리
res = model("It is not funny.")
print(res)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9997994303703308}]


In [4]:
type(model)

transformers.pipelines.text_classification.TextClassificationPipeline

In [6]:
data = [ 
    "The project was completed successfully.", 
    "She always brings positive energy to the team.", 
    "I am confident that we will achieve our goals.",
    "The results were not as expected.", 
    "He struggled to meet the deadline.", 
    "The client was dissatisfied with the final product." 
]

In [7]:
res = model(data)

In [9]:
from pprint import pprint
pprint(res)

[{'label': 'POSITIVE', 'score': 0.9998227953910828},
 {'label': 'POSITIVE', 'score': 0.9998812675476074},
 {'label': 'POSITIVE', 'score': 0.9998470544815063},
 {'label': 'NEGATIVE', 'score': 0.9978100657463074},
 {'label': 'NEGATIVE', 'score': 0.99960857629776},
 {'label': 'NEGATIVE', 'score': 0.9996129870414734}]


In [10]:
model="distilbert-base-uncased-finetuned-sst-2-english"
classifier = pipeline(task="text-classification", model=model, framework='pt')

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [11]:
kor_texts = [
    "이 영화 정말 재미있어요!",
    "서비스가 별로였어요.",
    "제품 품질이 우수합니다.",
    "따듯하고 부드럽고 제품은 너무 좋습니다. 그런데 배송이 너무 늦네요."  # 애매한 것 0.56 정도 나오네.
]
classifier(kor_texts)

[{'label': 'POSITIVE', 'score': 0.9855567812919617},
 {'label': 'POSITIVE', 'score': 0.7425776124000549},
 {'label': 'POSITIVE', 'score': 0.6555716395378113},
 {'label': 'NEGATIVE', 'score': 0.5247918367385864}]

In [12]:
model = 'Copycats/koelectra-base-v3-generalized-sentiment-analysis' 
classifier = pipeline(task="text-classification", model=model)
res = classifier(kor_texts)

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/452M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/263k [00:00<?, ?B/s]

In [13]:
res

[{'label': '1', 'score': 0.9897311329841614},
 {'label': '0', 'score': 0.9969298243522644},
 {'label': '1', 'score': 0.9640172123908997},
 {'label': '0', 'score': 0.5669127702713013}]

In [None]:
kor_texts = [
    "이 영화 정말 재미있어요!",
    "서비스가 별로였어요.",
    "제품 품질이 우수합니다.",
    "따듯하고 부드럽고 제품은 너무 좋습니다. 그런데 배송이 너무 늦네요."  # 애매한 것 0.56 정도 나오네.
]

### 제로샷 분류
- 제로샷(Zero-shot)은 각 개별 작업에 대한 특정 교육 없이 작업을 수행할 수 있는 task다.
- 입력 텍스트와 함께 클래스 레이블을 제공하면 분류 작업을 한다.
- 모델은  `task`에서 `Zero-Shot` 으로 시작하는 task를 선택하여 검색한다.

In [16]:
model = "facebook/bart-large-mnli"
classifier = pipeline("zero-shot-classification", model=model)
text = [
    "Python is a programming language.", 
    "I love soccer",
    "The stock price rose slightly."
]
label = ["education", "IT", "sport", "business", "movie"]
res = classifier(text, candidate_labels=label)
pprint(res)

[{'labels': ['IT', 'business', 'sport', 'movie', 'education'],
  'scores': [0.29554063081741333,
             0.240149587392807,
             0.16470593214035034,
             0.16107183694839478,
             0.13853201270103455],
  'sequence': 'Python is a programming language.'},
 {'labels': ['sport', 'IT', 'business', 'movie', 'education'],
  'scores': [0.9926486611366272,
             0.00432586669921875,
             0.0011035494972020388,
             0.0011012055911123753,
             0.0008206696365959942],
  'sequence': 'I love soccer'},
 {'labels': ['business', 'IT', 'sport', 'movie', 'education'],
  'scores': [0.7607609629631042,
             0.11387484520673752,
             0.048093654215335846,
             0.04662804305553436,
             0.030642475932836533],
  'sequence': 'The stock price rose slightly.'}]


### 텍스트 생성

In [20]:
generator = pipeline(task="text-generation", model="gpt2")
# 모델 이름 : 사용자/모델이름    사용자가 생략된 것은 huggingface에 제공하는 모델.
generator(["Today weather", "Python is"])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[[{'generated_text': "Today weather \xa0is a good thing for you... the humidity levels in the area are below 50%. \xa0In many ways, it makes you happy to actually get out and try some hot air. \xa0It's easy to understand this,"}],
 [{'generated_text': "Python is an open-source browser. Here's why.\n\nIt doesn't have any specific browser, and we don't have a special WebEngine that compiles it from source. Instead we can easily use the WebEngine.js library,"}]]

### 마스크 채우기

In [27]:
text = "I'm going to <mask> because <mask> am hurt."
model="distilroberta-base"
unmask = pipeline("fill-mask", model=model)
res = unmask(text, top_k=2) # 확률높은 단어 k개. (default: 5개)
pprint(res)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[{'score': 0.27230772376060486,
   'sequence': "<s>I'm going to cry because<mask> am hurt.</s>",
   'token': 8930,
   'token_str': ' cry'},
  {'score': 0.059206705540418625,
   'sequence': "<s>I'm going to sleep because<mask> am hurt.</s>",
   'token': 3581,
   'token_str': ' sleep'}],
 [{'score': 0.9910110235214233,
   'sequence': "<s>I'm going to<mask> because I am hurt.</s>",
   'token': 38,
   'token_str': ' I'},
  {'score': 0.008194930851459503,
   'sequence': "<s>I'm going to<mask> because i am hurt.</s>",
   'token': 939,
   'token_str': ' i'}]]


In [33]:
text = "오늘 밤은 전국이 흐린 가운데 대부분 지역에 [MASK]가 내리겠고, 기온이 내려가면서 점차 [MASK]이 오는 곳이 많겠습니다"

model='beomi/kcbert-base'
unmask_kor = pipeline('fill-mask', model=model)
res = unmask_kor(text, top_k=2)
pprint(res)

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[{'score': 0.6340416669845581,
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에서 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 '
               '오는 곳이 많겠습니다 [SEP]',
   'token': 4072,
   'token_str': '##서'},
  {'score': 0.11311744153499603,
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 비가 가 내리겠고, 기온이 내려가면서 점차 [MASK] '
               '이 오는 곳이 많겠습니다 [SEP]',
   'token': 28206,
   'token_str': '비가'}],
 [{'score': 0.10058414191007614,
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 [MASK] 가 내리겠고, 기온이 내려가면서 점차 바람 '
               '이 오는 곳이 많겠습니다 [SEP]',
   'token': 10108,
   'token_str': '바람'},
  {'score': 0.049839746206998825,
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 [MASK] 가 내리겠고, 기온이 내려가면서 점차 영향 '
               '이 오는 곳이 많겠습니다 [SEP]',
   'token': 10741,
   'token_str': '영향'}]]


### 개체명 인식
- task: token-classification 
  - ner, pos 를 수행하는 task 

In [34]:
text = "My name is Sylvain and I work at Hugging Face in Brooklyn."

model = "dbmdz/bert-large-cased-finetuned-conll03-english"
ner = pipeline("token-classification", model=model)
res = ner(text)
pprint(res)

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'end': 12,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99938285,
  'start': 11,
  'word': 'S'},
 {'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.99815494,
  'start': 12,
  'word': '##yl'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.99590707,
  'start': 14,
  'word': '##va'},
 {'end': 18,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.99923277,
  'start': 16,
  'word': '##in'},
 {'end': 35,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.9738931,
  'start': 33,
  'word': 'Hu'},
 {'end': 40,
  'entity': 'I-ORG',
  'index': 13,
  'score': 0.976115,
  'start': 35,
  'word': '##gging'},
 {'end': 45,
  'entity': 'I-ORG',
  'index': 14,
  'score': 0.9887976,
  'start': 41,
  'word': 'Face'},
 {'end': 57,
  'entity': 'I-LOC',
  'index': 16,
  'score': 0.9932106,
  'start': 49,
  'word': 'Brooklyn'}]


### 질의 응답
- 문서와 질문을 주면 문서에서 답을 찾아 응답한다.

In [41]:
model = "distilbert-base-cased-distilled-squad"
question="Where do I work?"
context="My name is Sylvain and I work at Hugging Face in Brooklyn"

In [45]:
qa = pipeline('question-answering', model=model)
res = qa(
    question=question, # 질문
    context=context    # 답을 찾을 문서
)

In [46]:
res

{'score': 0.43262603878974915, 'start': 49, 'end': 57, 'answer': 'Brooklyn'}

In [51]:
context = """Sri Lankans are voting for a new president in the first election since mass protests sparked by the country's worst-ever economic crisis unseated the leader in 2022.
Saturday's vote is widely regarded as a referendum on economic reforms meant to put the country on the road to recovery.
But many are still struggling to make ends meet because of tax hikes, and cuts to subsidies and welfare.
Multiple analysts predict that economic concerns will be front of mind for voters in what is shaping up to be a close race.
"The country's soaring inflation, skyrocketing cost-of-living and poverty have left the electorate desperate for solutions to stabilise prices and improve livelihoods," Soumya Bhowmick, an associate fellow at India-based think tank the Observer Research Foundation, told the BBC.
"With the country seeking to emerge from its economic collapse, this election serves as a crucial moment for shaping Sri Lanka’s recovery trajectory and restoring both domestic and international confidence in its governance."
President Ranil Wickremesinghe, who was charged with the monumental task of leading Sri Lanka out of its economic collapse, is seeking another term.
The 75-year-old was appointed by parliament a week after former president Gotabaya Rajapaksa was chased out of power.
Shortly after taking office, Wickremesinghe crushed what was left of the protest movement. He has also been accused of shielding the Rajapaksa family from prosecution and allowing them to regroup - allegations he has denied.
Another strong contender is leftist politician Anura Kumara Dissanayake, whose anti-corruption platform has seen him draw increasing public support.
More candidates are running in Saturday's election than any other in Sri Lanka's history. But of more than three dozen, four are dominating the limelight.
Other than Wickremesinghe and Dissanayake, there is also the leader of the opposition, Sajith Premadasa, and the 38-year-old nephew of the ousted president, Namal Rajapaksa.
Counting begins once polls close at 16:00 local time (10:30 GMT), but results are not expected to become clear until Sunday morning."""

# question="How many candidates are there in total?"
question="When will the results be known?" 
res = qa(question=question, context=context)

In [52]:
print(res)

{'score': 0.8689843416213989, 'start': 2109, 'end': 2123, 'answer': 'Sunday morning'}


### 문서 요약

In [53]:
model = "sshleifer/distilbart-cnn-12-6"
summerizer = pipeline("summarization", model=model)
res = summerizer(context)

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [55]:
print(res[0]["summary_text"])

 Sri Lankans are voting for a new president in the first election since the country's worst-ever economic crisis unseated the leader in 2022 . The vote is widely regarded as a referendum on economic reforms meant to put the country on the road to recovery . Many are still struggling to make ends meet because of tax hikes, and cuts to subsidies and welfare .


### 번역

In [None]:
model = "Helsinki-NLP/opus-mt-fr-en"
text = "Ce cours est produit par Hugging Face."

In [None]:
model = "Helsinki-NLP/opus-mt-ko-en"

### 이미지를 설명하는 텍스트 생성

In [None]:
url = "https://huggingface.co/datasets/Narsil/image_dummy/resolve/main/parrots.png"
url = "https://th.bing.com/th?id=ORMS.c526884bbea37c0bb9501f4f83b601e4&pid=Wdp&w=268&h=140&qlt=90&c=1&rs=1&dpr=1&p=0"
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

In [None]:
model = "ydshieh/vit-gpt2-coco-en"

### 이미지 분류

In [None]:
url = "https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202306/25/488f9638-800c-4bac-ad65-82877fbff79b.jpg"

In [None]:
model = "google/vit-base-patch16-224"

### Object Detection

In [None]:
image_path = r"data/image1.jpg"
image_path = r"data/image2.jpg"
image_path = r"data/image3.jpg"

model='facebook/detr-resnet-50'