# Package 설치

In [1]:
%pip install transformers tokenizers datasets accelerate sentencepiece pillow  timm -qU

Note: you may need to restart the kernel to use updated packages.


# Hugging Face Pipeline을 이용한 모델 활용

- Pipeline은 Transformers 라이브러리의 가장 기본적인 객체로, **전처리 - 추론 -> 후처리** 로 이어지는 일련의 과정을 자동화하여 손쉽게 모델을 사용할 수 있게 해준다.
- Task에 따라 다양한 Pipeline 클래스를 제공하며 `pipeline` 함수를 이용해 쉽게 생성할 수 있다.
- **task만 지정**해서 기본 제공 모델과 토크나이저를 사용하거나 **직접 모델과 토크나이저를 지정**해 생성할 수 있다.
- https://huggingface.co/docs/transformers/pipeline_tutorial

![huggingface_pipeline.png](figures/huggingface_pipeline.png)

## 지원하는 주요 태스크
- https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task
### 자연어 처리 태스크
- **text-classification**: 텍스트 분류
- **text-generation**: 텍스트 생성
- **translation**: 번역
- **summarization**: 요약
- **question-answering**: 질의응답
- **fill-mask**: 마스크 토큰 채우기
- **token-classification**: 개체명 인식, Pos tagging 같이 개별 토큰에 대한 분류
- **feature-extraction**: 특징 추출(context vector)

### 영상 처리 태스크
- **image-classification**: 이미지 분류
- **object-detection**
  -  객체 검출 (Object Detection)
  -  이미지 안에서 객체들의 위치와 class를 찾아내는 작업
- **image-segmentation**
  -  이미지 세분화 (Image Segmentation)
  -  이미지를 픽셀 단위로 분할하여 각 픽셀이 어떤 객체에 속하는지 분류하는 작업

## 모델 검색
![huggingface_model_search.png](figures/huggingface_model_search.png)



## pipeline 함수
- 주요파라미터
  - **task:** 수행하려는 작업의 유형을 문자열로 지정한다.
  - **model:**
    - 사용할 사전 학습된 모델의 이름 또는 경로를 지정한다. 
    - 모델이름(ID)은 `[모델소유자이름]/[모델이름]` 형식이다. Hugging Face에서 제공하는 모델의 경우는 `모델소유자이름`이 생략되어 있다. (ex: "google/gemma-2-2b", "gpt2")
    - 모델을 명시적으로 지정하지 않으면, **task에 맞는 기본 모델이 로드**된다.
  - **tokenizer:** 자연어 task에서 사용할 토크나이저를 지정한다. 생략하면 모델과 같이 제공되는(model과 이름이 같은 토크나이저) 토크나이저를 사용한다.
  - **framework:** 사용할 딥러닝 프레임워크를 지정한다. 'pt'는 PyTorch(Default), 'tf'는 TensorFlow를 지정한다.
  - **device:** Pipeline 모델을 실행할 디바이스를 지정한다. 문자열로 `"cpu", "cuda:1", "mps"`, 또는 GPU 번호를 정수로 지정한다. 
  - **revision:** 모델의 특정 버전을 지정할 때 사용한다.
  - **trust_remote_code:** hub 모델을 직접 다운 받는 것이 아니라 모델을 다운 받는 **코드**를 다운 받아 local에서 실행하는 경우 코드를 실행할 수있게 할 지 여부. (bool)
  - **use_fast:** 
    - 빠른 토크나이저를 사용할지 여부를 지정합니다. 기본값은 True입니다.
    - 빠른 토크나이저는 `Rust` 언어로 구현되어 속도가 빠르다. 단 모든 모델에 대해 지원하지 않는다. 지원하지 않을 경우 `use_fast=True`로 설정해도 일반 토크나이저가 사용된다.

## Task 별 pipeline 실습

### 텍스트 분류

In [2]:
from transformers import pipeline

In [3]:
# 모델과 토크나이저를 로딩해서 Pipeline을 생성
# 모델, 토크나이저를 생략 -> task에 맞는 기본 모델과 토크나이저를 사용.
pipe = pipeline(task="text-classification", framework="pt")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


In [None]:
result = pipe("I am very happy.")
result = pipe("I am very unhappy.") # raw -> ||토큰화 -> 추론 -> 후처리 ->|| 최종예측결과

In [7]:
result

[{'label': 'NEGATIVE', 'score': 0.9997710585594177}]

In [8]:
data = [ 
    "The project was completed successfully.", 
    "She always brings positive energy to the team.", 
    "I am confident that we will achieve our goals.",
    "The results were not as expected.", 
    "He struggled to meet the deadline.", 
    "The client was dissatisfied with the final product." 
]
result_list = pipe(data)

In [10]:
result_list

[{'label': 'POSITIVE', 'score': 0.9998227953910828},
 {'label': 'POSITIVE', 'score': 0.9998812675476074},
 {'label': 'POSITIVE', 'score': 0.9998470544815063},
 {'label': 'NEGATIVE', 'score': 0.9978100657463074},
 {'label': 'NEGATIVE', 'score': 0.99960857629776},
 {'label': 'NEGATIVE', 'score': 0.9996129870414734}]

In [None]:
# 특정 모델을 지정해서 사용.
model="distilbert-base-uncased-finetuned-sst-2-english" 
# huggingface에 등록된 모델 ID
## 모델 ID 형식 - 모델소유자ID/모델ID , 모델소유자ID가 생략된 경우: huggingface 자체 모델.
pipe = pipeline(task="text-classification", 
                model=model,  # 사용할 모델을 지정. hf의 모델 ID, 로컬에 저장된 모델저장파일경로.
                tokenizer=model # 사용할 tokenizer를 지정. hf의 토크나이저 ID, 로컬에 저장된 tokenizer 파일경로.
                                # 토크나이저의 ID가 model id와 같을 경우 생략.
                )

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


In [12]:
pipe(data)

[{'label': 'POSITIVE', 'score': 0.9998227953910828},
 {'label': 'POSITIVE', 'score': 0.9998812675476074},
 {'label': 'POSITIVE', 'score': 0.9998470544815063},
 {'label': 'NEGATIVE', 'score': 0.9978100657463074},
 {'label': 'NEGATIVE', 'score': 0.99960857629776},
 {'label': 'NEGATIVE', 'score': 0.9996129870414734}]

In [13]:
kor_texts = [
    "이 영화 정말 재미있어요!",
    "서비스가 별로였어요.",
    "제품 품질이 우수합니다.",
    "따듯하고 부드럽고 제품은 너무 좋습니다. 그런데 배송이 너무 늦네요."  # 애매한 것 0.56 정도 나오네.
]

In [14]:
pipe(kor_texts)

[{'label': 'POSITIVE', 'score': 0.9855567812919617},
 {'label': 'POSITIVE', 'score': 0.7425776124000549},
 {'label': 'POSITIVE', 'score': 0.6555716395378113},
 {'label': 'NEGATIVE', 'score': 0.5247918367385864}]

In [15]:
model = 'Copycats/koelectra-base-v3-generalized-sentiment-analysis' 
pipe = pipeline(task="text-classification", model=model)
result_list = pipe(kor_texts)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


In [None]:
result_list  # 1: 긍정, 0: 부정

[{'label': '1', 'score': 0.9897311329841614},
 {'label': '0', 'score': 0.9969298243522644},
 {'label': '1', 'score': 0.9640172123908997},
 {'label': '0', 'score': 0.5669127702713013}]

In [19]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")

Device set to use cpu


In [20]:
pipe("오늘은 날씨가 너무 좋다.")

[{'label': 'Very Positive', 'score': 0.665915846824646}]

In [None]:
kor_texts = [
    "이 영화 정말 재미있어요!",
    "서비스가 별로였어요.",
    "제품 품질이 우수합니다.",
    "따듯하고 부드럽고 제품은 너무 좋습니다. 그런데 배송이 너무 늦네요."  # 애매한 것 0.56 정도 나오네.
]

In [21]:
pipe(kor_texts)

[{'label': 'Very Positive', 'score': 0.5016762614250183},
 {'label': 'Negative', 'score': 0.5245441794395447},
 {'label': 'Very Positive', 'score': 0.6271345615386963},
 {'label': 'Very Positive', 'score': 0.5461609959602356}]

In [22]:
result_list = pipe(data)

In [24]:
for txt, r in zip(data, result_list):
    print(txt, r)

The project was completed successfully. {'label': 'Neutral', 'score': 0.49546512961387634}
She always brings positive energy to the team. {'label': 'Positive', 'score': 0.6366152167320251}
I am confident that we will achieve our goals. {'label': 'Neutral', 'score': 0.45890164375305176}
The results were not as expected. {'label': 'Negative', 'score': 0.6052263975143433}
He struggled to meet the deadline. {'label': 'Negative', 'score': 0.518038272857666}
The client was dissatisfied with the final product. {'label': 'Negative', 'score': 0.6167647838592529}


### 제로샷 분류
- 제로샷(Zero-shot)은 각 개별 작업에 대한 특정 교육 없이 작업을 수행할 수 있는 task다.
- 입력 텍스트와 함께 클래스 레이블을 제공하면 분류 작업을 한다.
- 모델은  `task`에서 `Zero-Shot` 으로 시작하는 task를 선택하여 검색한다.

In [25]:
model = "facebook/bart-large-mnli"

text = ["Python is a programming language.", 
        "I love soccer", 
        "The stock price rose slightly today."]

labels = ["IT", "Sports"]

In [26]:
pipe = pipeline(task="zero-shot-classification", model=model)
result = pipe(text, candidate_labels=labels)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


In [27]:
result

[{'sequence': 'Python is a programming language.',
  'labels': ['IT', 'Sports'],
  'scores': [0.5758535265922546, 0.4241464138031006]},
 {'sequence': 'I love soccer',
  'labels': ['Sports', 'IT'],
  'scores': [0.9935312867164612, 0.006468690931797028]},
 {'sequence': 'The stock price rose slightly today.',
  'labels': ['IT', 'Sports'],
  'scores': [0.6849520802497864, 0.3150479197502136]}]

In [28]:
labels = ["business", "programming", "sports", "movie", "education"]
result = pipe(text, candidate_labels=labels)
result

[{'sequence': 'Python is a programming language.',
  'labels': ['programming', 'business', 'movie', 'sports', 'education'],
  'scores': [0.9856367111206055,
   0.005072721280157566,
   0.0034023483749479055,
   0.002961924998089671,
   0.0029262355528771877]},
 {'sequence': 'I love soccer',
  'labels': ['sports', 'programming', 'business', 'movie', 'education'],
  'scores': [0.9952405691146851,
   0.0012840895215049386,
   0.0012676474871113896,
   0.0012649551499634981,
   0.0009427034528926015]},
 {'sequence': 'The stock price rose slightly today.',
  'labels': ['business', 'movie', 'programming', 'sports', 'education'],
  'scores': [0.7462778091430664,
   0.06974831968545914,
   0.06889291107654572,
   0.0645080953836441,
   0.05057287961244583]}]

### 텍스트 생성

In [31]:
pipe = pipeline(task="text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


In [32]:
start_text = "Today weather"
sent = pipe(start_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
sent

[{'generated_text': 'Today weather to be quite sunny and even the sun will be very warm. I would like to see how much snow there will be and what kind of snow people will be getting.\n\nThe sun is about to set so my kids are about to go to bed, so it is starting to get very cold and I am very worried.\n\nI am sure I will have a good time.\n\nI have been planning to write a video so please feel free to post this on Facebook too.'}]

In [34]:
print(sent[0]["generated_text"])

Today weather to be quite sunny and even the sun will be very warm. I would like to see how much snow there will be and what kind of snow people will be getting.

The sun is about to set so my kids are about to go to bed, so it is starting to get very cold and I am very worried.

I am sure I will have a good time.

I have been planning to write a video so please feel free to post this on Facebook too.


In [35]:
pipe(["I am", "Python is", "LLM is"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': 'I am not aware of anyone with the knowledge of this document, as I will not be able to provide it to others.\n\nI would like to offer my sincere sympathy to the people of Egypt, who are very distressed and sad."\n\nThe statement was posted on Twitter on Monday.'}],
 [{'generated_text': 'Python is a very good tool for doing a lot of things.\n\nPython is one of the most popular programming languages and many people think of Python as a nice programming language. After all, Python is the programming language that can be used to write code that you can put in a web browser and make use of.\n\nBut what is Python?\n\nPython is an open-source language, meaning that the developers are free to use any framework, library, or language. However, it is not an official language, meaning that developers are free to use any language.\n\nThis is because Python is not an official language and therefore not a programming language.\n\nIt is, however, a programming language that can b

In [36]:
pipe("나는 어제 ")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '나는 어제 한 있어 형적는 나안 비어 이정이 여는 병버 해는 있어 형적지에 사하기는 너모도 던는 고어 형적는 있어 병버 해는 고어 형적는 있어 병버 해는 있어 형적는 있어 형적는 있어 병버 해는 고어 형적는 있어 병버 해는 '}]

In [2]:
from transformers import pipeline

model_id = 'Qwen/Qwen3-0.6B'
pipe = pipeline(task="text-generation", model=model_id)

Device set to use cpu


In [38]:
pipe("나는 어제")

[{'generated_text': '나는 어제 1일에 5일 동안 팀을 위한 작업을 완료했으며, 이전에 진행한 2일 동안의 작업을 마감했으며, 그리고 2일 동안의 작업을 완료했으며, 3일 동안의 작업을 완료했으며, 4일 동안의 작업을 완료했으며, 5일 동안의 작업을 완료했으며. 2일 동안의 작업을 완료했으며, 3일 동안의 작업을 완료했으며, 4일 동안의 작업을 완료했으며, 5일 동안의 작업을 완료했으며. 3일 동안의 작업을 완료했으며, 4일 동안의 작업을 완료했으며, 5일 동안의 작업을 완료했으며. 2일 동안의 작업을 완료했으며, 3일 동안의 작업을 완료했으며, 4일 동안의 작업을 완료했으며, 5일 동안의 작업을 완료했으며. 1일 동안의'}]

In [12]:
msg = [
    {"role":"user", "content":"LLM에 대해서 설명해줘."}
]
result = pipe(msg, max_new_tokens=1000)
#  max_new_tokens : 응답 토큰수 설정.

In [None]:
result[0]["generated_text"][0]  # user 입력(query)

{'role': 'user', 'content': 'LLM에 대해서 설명해줘.'}

In [10]:
result[0]["generated_text"][1] # AI 답변

{'role': 'assistant',
 'content': "<think>\nOkay, the user asked for an explanation of LLMs. Let me start by recalling what LLMs are. They are Large Language Models, right? So, I need to define them clearly.\n\nFirst, I should mention that they are machine learning models trained on a large amount of text data. Then, their main purpose is to understand and generate human-like text. I should explain their capabilities, like understanding complex sentences, generating text, and answering questions. Also, it's important to note that LLMs are trained on vast datasets, which makes them powerful but not perfect.\n\nWait, should I mention their limitations? Maybe. But the user might just want a basic explanation. Let me check if I need to add any technical details. For example, how they are trained, maybe the training data source. Also, maybe their use cases. But since the user asked for an explanation, I should keep it straightforward.\n\nI should structure it with an introduction, definitio

In [13]:
print(result[0]["generated_text"][1]['content'])

<think>
Okay, the user wants an explanation of LLMs. Let me start by recalling what an LLM is. It stands for Large Language Model. First, I should define it clearly. Maybe mention that it's a type of AI model developed for various tasks like language understanding and generation.

I need to explain its components. Are they trained on huge datasets? Yes, so I should highlight that. Also, the training process involves a lot of data and computational resources. But wait, should I mention that they are trained on specific data, like text, and not just any data? That's important for clarity.

Then, the purpose of LLMs. They're used for tasks like language processing, text generation, and even other applications. Maybe give examples like writing stories, answering questions, or translating. It's useful for different industries, so that's a good point.

Wait, I should also touch on some key features. For example, they can understand and generate text, have different languages, and be trained 

### 마스크 채우기

In [14]:
text = "I'm going to <mask> because <mask> am hurt."
model="distilroberta-base"

pipe = pipeline(task="fill-mask", model=model)
result = pipe(text, top_k=2) # <mask>에 들어갈 확률이 가장높은 단어 2개(top_k=2)를 찾기.

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [15]:
result

[[{'score': 0.26520147919654846,
   'token': 8930,
   'token_str': ' cry',
   'sequence': "<s>I'm going to cry because<mask> am hurt.</s>"},
  {'score': 0.06089096516370773,
   'token': 3581,
   'token_str': ' sleep',
   'sequence': "<s>I'm going to sleep because<mask> am hurt.</s>"}],
 [{'score': 0.9930052161216736,
   'token': 38,
   'token_str': ' I',
   'sequence': "<s>I'm going to<mask> because I am hurt.</s>"},
  {'score': 0.006336296442896128,
   'token': 939,
   'token_str': ' i',
   'sequence': "<s>I'm going to<mask> because i am hurt.</s>"}]]

In [18]:
text = "오늘 밤은 전국이 흐린 가운데 대부분 지역에 [MASK]가 내리겠고, 기온이 내려가면서 점차 [MASK]이 오는 곳이 많겠습니다"
# pipe(text, top_k=2) # roberta모델은 한글을 학습하지 않은 모델.

model='beomi/kcbert-base'
pipe = pipeline(task='fill-mask', model=model)

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/250k [00:00<?, ?B/s]

Device set to use cpu


In [19]:
pipe(text)

[[{'score': 0.6340416669845581,
   'token': 4072,
   'token_str': '##서',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에서 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 오는 곳이 많겠습니다 [SEP]'},
  {'score': 0.11311744153499603,
   'token': 28206,
   'token_str': '비가',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 비가 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 오는 곳이 많겠습니다 [SEP]'},
  {'score': 0.03714243322610855,
   'token': 12,
   'token_str': ')',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 ) 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 오는 곳이 많겠습니다 [SEP]'},
  {'score': 0.035172488540410995,
   'token': 1664,
   'token_str': '비',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 비 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 오는 곳이 많겠습니다 [SEP]'},
  {'score': 0.0196221936494112,
   'token': 9666,
   'token_str': '##서는',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에서는 가 내리겠고, 기온이 내려가면서 점차 [MASK] 이 오는 곳이 많겠습니다 [SEP]'}],
 [{'score': 0.10058414191007614,
   'token': 10108,
   'token_str': '바람',
   'sequence': '[CLS] 오늘 밤은 전국이 흐린 가운데 대부분 지역에 [MASK] 가 내리겠고,

### Token별 분류
- task: token-classification 
  - 개체명인식(ner), 품사부착(pos tagging)을 수행하는 task 
  - 개체명 인식은 문장에서 특정한 개체명(예: 사람 이름, 지명, 조직명 등)을 식별하는 task이다. 

In [21]:
text = "My name is Sylvain and I work at Hugging Face in Brooklyn."
model = "dbmdz/bert-large-cased-finetuned-conll03-english"

pipe = pipeline(task='token-classification', model=model)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [22]:
result = pipe(text)
result

[{'entity': 'I-PER',
  'score': 0.99938285,
  'index': 4,
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity': 'I-PER',
  'score': 0.99815494,
  'index': 5,
  'word': '##yl',
  'start': 12,
  'end': 14},
 {'entity': 'I-PER',
  'score': 0.99590707,
  'index': 6,
  'word': '##va',
  'start': 14,
  'end': 16},
 {'entity': 'I-PER',
  'score': 0.99923277,
  'index': 7,
  'word': '##in',
  'start': 16,
  'end': 18},
 {'entity': 'I-ORG',
  'score': 0.9738931,
  'index': 12,
  'word': 'Hu',
  'start': 33,
  'end': 35},
 {'entity': 'I-ORG',
  'score': 0.976115,
  'index': 13,
  'word': '##gging',
  'start': 35,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9887976,
  'index': 14,
  'word': 'Face',
  'start': 41,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.9932106,
  'index': 16,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### 질의 응답
- 문서와 질문을 주면 문서에서 답을 찾아 응답한다.

In [None]:
model = "distilbert-base-cased-distilled-squad"
pipe = pipeline(task="question-answering", model=model)

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


In [26]:
question="Where do I work?"
question="Where is Hugging Face?"
context="My name is Sylvain and I work at Hugging Face in Brooklyn"

result = pipe(question=question,  # 질문
              context=context)    # 답을 찾을 문서

In [27]:
result

{'score': 0.9893267154693604, 'start': 49, 'end': 57, 'answer': 'Brooklyn'}

In [30]:
context = """우리나라 2대 수출 품목인 자동차가 도널드 트럼프 미국 행정부의 관세 여파로 지난달 큰 폭의 수출 감소율을 보이면서 우려가 커지고 있다. 현대차, 기아의 미국 수출 비중이 최대 85%에 이르는 상황에서 자동차 관세 장기화 시 피해는 걷잡을 수 없이 불어날 것이라는 암울한 전망이 나온다.
1일 산업통상자원부가 발표한 5월 수출입 동향에 따르면 지난달 자동차 수출은 작년 동기 대비 4.4% 감소한 62억달러로 집계됐다. 최대 자동차 시장인 미국으로의 수출은 18억4000만달러로 무려 32.0% 급감했다.
4월 미국의 수입산 자동차 25% 관세 부과에 이어 5월부터 일부 자동차 부품에도 25%의 관세가 적용된 결과다. 관세 장기화 시 피해는 더 커질 것이라는 우려가 현실화한 셈이다.
국내 완성차 1·2위 업체인 현대차·기아는 현지 생산 비중을 확대하는 동시에 가격 인상을 검토하고 있다. 관세 여파를 흡수하기 위해서다. 가격 인상이 현실화할 경우 미국 현지 판매는 줄어들 수밖에 없어 수출에는 더 악영향을 미칠 것으로 보인다.
"""

q1 = "현대차 기아의 미국 수출비중은?"
q2 = "자동차 수출이 얼마나 급감했나?"
q3 = "대미 수출 감소에 국내 자동차 업체들의 대응방법은?"

In [29]:
model = "ainize/klue-bert-base-mrc"
pipe = pipeline(task='question-answering', model=model)

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/495k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [31]:
result = pipe(question=[q1, q2, q3], context=context)

In [32]:
result

[{'score': 0.6396173238754272, 'start': 99, 'end': 102, 'answer': '85%'},
 {'score': 0.632358193397522, 'start': 271, 'end': 276, 'answer': '32.0%'},
 {'score': 0.013693439774215221, 'start': 427, 'end': 433, 'answer': '가격 인상을'}]

### 문서 요약

In [33]:
from transformers import pipeline 

model = "eenzeenee/t5-base-korean-summarization"
pipe = pipeline(task="summarization", model=model)

config.json:   0%|          | 0.00/782 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


In [34]:
result = pipe(context)

Token indices sequence length is longer than the specified maximum sequence length for this model (368 > 128). Running this sequence through the model will result in indexing errors


In [35]:
result

[{'summary_text': '자동차가 트럼프 미국 행정부의 관세 여파로 큰 폭의 수출 감소율을 보이면서 자동차 관세 장기화 시 피해는 걷잡을 수 없이 불어날 것이라는 암울한 전망이 나온다.'}]

### 번역

In [36]:
model = "Helsinki-NLP/opus-mt-fr-en"
text = "Ce cours est produit par Hugging Face."

pipe = pipeline(task='translation', model=model)

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cpu


In [37]:
result = pipe(text)
result

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [38]:
model = "Helsinki-NLP/opus-mt-ko-en"
pipe = pipeline(task='translation', model=model)

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/842k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/813k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Device set to use cpu


In [39]:
res = pipe(["이 문장을 영어로 번역합니다.", "날씨가 점점 더워집니다.", "오늘 비가 올 것 같습니다."])

In [40]:
res

[{'translation_text': 'I translate this sentence into English.'},
 {'translation_text': 'The weather gets warmer and warmer.'},
 {'translation_text': "It's going to rain today."}]

### 이미지를 설명하는 텍스트 생성

In [41]:
url1 = "https://huggingface.co/datasets/Narsil/image_dummy/resolve/main/parrots.png"
url2 = "https://th.bing.com/th?id=ORMS.c526884bbea37c0bb9501f4f83b601e4&pid=Wdp&w=268&h=140&qlt=90&c=1&rs=1&dpr=1&p=0"
url3 = "http://images.cocodataset.org/val2017/000000039769.jpg"

In [42]:
model = "ydshieh/vit-gpt2-coco-en"
pipe = pipeline(task="image-to-text", model=model)
result = pipe(url1)
result

config.json:   0%|          | 0.00/4.34k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/982M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/211 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.53.0. You should pass an instance of `Cache` instead, e.g. `past_key_values=DynamicCache.from_legacy_cache(past_key_values)`.


[{'generated_text': 'two birds are standing next to each other '}]

In [43]:
result = pipe([url1, url2, url3])  # 1. 이미지의 url, (여러개일 경우 묶어서 전달)

In [44]:
result

[[{'generated_text': 'two birds are standing next to each other '}],
 [{'generated_text': 'a baseball player is throwing a ball '}],
 [{'generated_text': 'a cat laying on a blanket next to a cat laying on a bed '}]]

In [45]:
result = pipe(["data/image1.jpg", "data/image2.jpg"]) # 2. 이미지 파일 경로.

In [46]:
result

[[{'generated_text': 'a man walking a dog down a path with a dog '}],
 [{'generated_text': 'a cat and a dog are standing on a wooden bench '}]]

### 이미지 분류

In [47]:
url = "https://pds.joongang.co.kr/news/component/htmlphoto_mmdata/202306/25/488f9638-800c-4bac-ad65-82877fbff79b.jpg"

In [48]:
model = "google/vit-base-patch16-224"
pipe = pipeline(task='image-classification', model=model)

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Device set to use cpu


In [None]:
# 분류 class를 확인
pipe.model.config.id2label # class가 1000개 인 분류 모델. (imagenet 데이터셋을 학습시킨 모델.)

{0: 'tench, Tinca tinca',
 1: 'goldfish, Carassius auratus',
 2: 'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',
 3: 'tiger shark, Galeocerdo cuvieri',
 4: 'hammerhead, hammerhead shark',
 5: 'electric ray, crampfish, numbfish, torpedo',
 6: 'stingray',
 7: 'cock',
 8: 'hen',
 9: 'ostrich, Struthio camelus',
 10: 'brambling, Fringilla montifringilla',
 11: 'goldfinch, Carduelis carduelis',
 12: 'house finch, linnet, Carpodacus mexicanus',
 13: 'junco, snowbird',
 14: 'indigo bunting, indigo finch, indigo bird, Passerina cyanea',
 15: 'robin, American robin, Turdus migratorius',
 16: 'bulbul',
 17: 'jay',
 18: 'magpie',
 19: 'chickadee',
 20: 'water ouzel, dipper',
 21: 'kite',
 22: 'bald eagle, American eagle, Haliaeetus leucocephalus',
 23: 'vulture',
 24: 'great grey owl, great gray owl, Strix nebulosa',
 25: 'European fire salamander, Salamandra salamandra',
 26: 'common newt, Triturus vulgaris',
 27: 'eft',
 28: 'spotted salamander, Ambystoma 

In [None]:
# pipe(url)
pipe(url, top_k=3)

[{'label': 'Egyptian cat', 'score': 0.8531318306922913},
 {'label': 'tabby, tabby cat', 'score': 0.047503840178251266},
 {'label': 'tiger cat', 'score': 0.03486618027091026}]

In [52]:
pipe('data/image3.jpg')

[{'label': 'desktop computer', 'score': 0.84475177526474},
 {'label': 'screen, CRT screen', 'score': 0.07074456661939621},
 {'label': 'monitor', 'score': 0.04988379031419754},
 {'label': 'hand-held computer, hand-held microcomputer',
  'score': 0.005030939821153879},
 {'label': 'computer keyboard, keypad', 'score': 0.00468211155384779}]

### Object Detection

In [53]:
image_path = r"data/image1.jpg"
image_path = r"data/image2.jpg"
image_path = r"data/image3.jpg"

model='facebook/detr-resnet-50'

pipe = pipeline(task="object-detection", model=model)

config.json:   0%|          | 0.00/4.59k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForObjectD

preprocessor_config.json:   0%|          | 0.00/290 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
res = pipe(image_path)

In [56]:
res

[{'score': 0.9956638216972351,
  'label': 'cell phone',
  'box': {'xmin': 96, 'ymin': 165, 'xmax': 136, 'ymax': 236}},
 {'score': 0.9919518232345581,
  'label': 'tv',
  'box': {'xmin': 147, 'ymin': 28, 'xmax': 429, 'ymax': 240}},
 {'score': 0.9975364208221436,
  'label': 'keyboard',
  'box': {'xmin': 108, 'ymin': 251, 'xmax': 358, 'ymax': 304}}]

In [57]:
res = pipe("data/image1.jpg")
res

[{'score': 0.9988904595375061,
  'label': 'dog',
  'box': {'xmin': 430, 'ymin': 423, 'xmax': 533, 'ymax': 597}},
 {'score': 0.9998466968536377,
  'label': 'person',
  'box': {'xmin': 531, 'ymin': 158, 'xmax': 673, 'ymax': 581}}]

In [None]:
pipe = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [59]:
model = pipe.model
tokenizer = pipe.tokenizer

In [60]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [61]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [None]:
pipe("I am sad.")