# 1016

## **Hugging Face Hub에서 변환된 모델을 사용하는 방법**
Hugging Face Hub는 Meta Llama 가중치를 변환하여 업로드한 모델을 제공할 수 있습니다. 이 경우, 사용자는 변환된 모델을 바로 다운로드하고 사용할 수 있습니다.

#### 절차:
1. **Hugging Face Hub에 로그인**:
   Hugging Face Hub에서 모델을 다운로드하려면 Hugging Face 계정에 로그인해야 합니다. 이를 위해 `huggingface_hub` 라이브러리를 사용할 수 있습니다.
   ```python
   from huggingface_hub import login
   login()  # 명령어 실행 후, 계정 정보를 입력하여 로그인
   ```

2. **모델과 토크나이저 불러오기**:
   Hugging Face에 업로드된 Meta Llama 모델을 다운로드하여 사용합니다. 예를 들어, `meta-llama/Meta-Llama-3.1-8B-Instruct`라는 이름의 모델을 불러올 수 있습니다:
   ```python
   from transformers import AutoModelForCausalLM, AutoTokenizer

   model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
   model = AutoModelForCausalLM.from_pretrained(model)
   tokenizer = AutoTokenizer.from_pretrained(model)
   ```

3. **파이프라인 설정**:
   텍스트 생성 파이프라인을 설정하여 모델을 사용할 수 있습니다:
   ```python
   from transformers import pipeline

   generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
   input_text = "Once upon a time,"
   outputs = generator(input_text, max_length=50, num_return_sequences=1)
   print(outputs)
   ```

이 방법은 가중치 변환 과정 없이 Hugging Face Hub에 업로드된 모델을 바로 사용할 수 있어 간편합니다. 그러나 사용자가 접근할 수 있는 모델은 허브에 업로드된 변환된 모델에 한정됩니다.

---

### 방법별 비교

| 방법 | 장점 | 단점 |
| --- | --- | --- |
| **Meta에서 직접 다운로드하여 변환** | 최신 모델 가중치를 사용할 수 있음 | 가중치 다운로드 요청 필요, 변환 과정이 복잡함 |
| **Hugging Face Hub에서 사용** | 변환 없이 바로 사용 가능, 간편함 | 허브에 업로드된 모델에 한정됨, Meta의 가중치 배포보다 제한적 |



## Hugging Face Hub에 업로드된 허가된 Llama 모델을 가져오는 방식
- Meta에서 직접 다운로드하는 방식과는 별개로, 해당 모델이 Hugging Face Hub에 업로드되어 있으면, 로그인 후 쉽게 다운로드하여 사용
- 해당 모델이 Hugging Face 허브에 업로드되어 있어야 하고, 사용자가 해당 모델에 대한 접근 권한을 가져야 합니다.

In [1]:
!pip install huggingface_hub
!pip install transformers
!pip install accelerate



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import transformers

In [3]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]



In [None]:
sequence = pipeline(
    "I have tomatoes . basil and cheese at home. what can I cook for dinner?\n",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation = True,
    max_length=400,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')



Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


result: I have tomatoes . basil and cheese at home. what can I cook for dinner?
You've got a great starting point for a delicious dinner! With tomatoes, basil, and cheese, here are some tasty ideas:

1. **Caprese Salad**: Slice the tomatoes, layer them with fresh basil leaves, and top with shredded mozzarella cheese. Drizzle with olive oil and balsamic vinegar for a simple yet flavorful salad.
2. **Tomato and Basil Pizza**: Use pre-made pizza dough or a crust, top it with tomato sauce, sliced tomatoes, fresh basil leaves, and shredded mozzarella cheese. Bake in the oven until the crust is golden brown and the cheese is melted.
3. **Grilled Cheese and Tomato Sandwich**: Butter two slices of bread, place sliced tomatoes and a sprinkle of basil between them, and top with shredded cheese. Grill until the bread is toasted and the cheese is melted.
4. **Basil and Tomato Bruschetta**: Toast some bread, top it with diced tomatoes, fresh basil leaves, and a sprinkle of mozzarella cheese. Drizzl

https://huggingface.co/meta-llama/Llama-3.2-1B

In [None]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
model = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [6]:
from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
input_text = "Once upon a time"
output_text = generator(input_text, max_length=50, num_return_sequences=1)
print(output_text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'Once upon a time, there was a woman who had a problem with her teeth. She went to the dentist and had a cavity filled, but she was still left with a gap in her smile. She tried to cover it up with a bridge'}]


In [8]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

In [10]:
sequence = pipeline(
    "I have tomatoes . basil and cheese at home. what can I cook for dinner?\n",
    do_sample=True, #모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위k개 토큰중 무작위로 하나를 선택하여 다양한 응답생성
    top_k=10,
    num_return_sequences=1, #생성된 텍스트 시퀀스(응답)의 수를 11개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문자의 끝을 나타내는 토큰 id를 지정합니다. 모델이 이토큰을 만나면 텍스트 생성을 멈춤
    truncation = True,
    max_length=50,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result: I have tomatoes . basil and cheese at home. what can I cook for dinner?
I have tomatoes, basil and cheese at home. what can I cook for dinner?
You could make a tomato, basil and cheese pasta dish, or you could


모델을 실행해보고 성능을 평가, 목적에 맞는 모델을 사용

# 1017

In [12]:
sequence = pipeline(
    "Once upon a time\n",
    do_sample=True, #모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위k개 토큰중 무작위로 하나를 선택하여 다양한 응답생성
    top_k=10,
    num_return_sequences=1, #생성된 텍스트 시퀀스(응답)의 수를 11개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문자의 끝을 나타내는 토큰 id를 지정합니다. 모델이 이토큰을 만나면 텍스트 생성을 멈춤
    truncation = True,
    max_length=400,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result: Once upon a time
Once upon a time, I was a young woman, and I was not happy. I was unhappy because I had no idea what I wanted to be when I grew up. I had no idea how to find the answer to that question, so I did what I always did: I looked at the people around me and tried to figure out what they were doing. I asked them, “What do you do?” I would listen to their answers and then I would try to figure out what they were doing. I would try to figure out why they were doing what they were doing. I would try to figure out what I could do to help them. I would try to figure out how to help them. I would try to figure out how to make my life better.
I had a lot of people in my life that I could ask for advice. I had a lot of people that I could ask for advice. I had a lot of people that I could ask for advice. I had a lot of people that I could ask for advice. I had a lot of people that I could ask for advice. I had a lot of people that I could ask for advice. I had a lot of people

In [13]:
sequence = pipeline(
    "옛날 옛적에 \n",
    do_sample=True, #모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위k개 토큰중 무작위로 하나를 선택하여 다양한 응답생성
    top_k=10,
    num_return_sequences=1, #생성된 텍스트 시퀀스(응답)의 수를 11개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문자의 끝을 나타내는 토큰 id를 지정합니다. 모델이 이토큰을 만나면 텍스트 생성을 멈춤
    truncation = True,
    max_length=400,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result: 옛날 옛적에 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느새 
이젠 
어느새 
오늘날에 
어느


In [16]:
sequence = pipeline(
    "The weather in korea",
    do_sample=True, #모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위k개 토큰중 무작위로 하나를 선택하여 다양한 응답생성
    top_k=10,
    num_return_sequences=1, #생성된 텍스트 시퀀스(응답)의 수를 11개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문자의 끝을 나타내는 토큰 id를 지정합니다. 모델이 이토큰을 만나면 텍스트 생성을 멈춤
    truncation = True,
    max_length=400,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result: The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy. The weather in korea is also very different from what we are used to. It is usually cold and rainy, and it can be very windy.
It is usually cold and rainy, and it can be very windy. The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy.
It is usually cold and rainy, and it can be very windy. The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy.
The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy. The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy.
The weather in korea is very different from what we are used to. It is usually cold and rainy, and it can be very windy. The weather in korea is very

1b는 모바일이나 온디바이스 용으로 적합`

In [17]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-instruct"
pipe = pipeline('text-generation', model=model_id, torch_dtype=torch.float16, device_map="auto")

message = [
    {"role":"system", "content":"You are a pirate chatbot who always speaks like a pirate."},
    {"role":"user", "content":"Hello, how are you?"}
]

outputs = pipe(message, max_length=256)

print(outputs[0]['generated_text'][-1])

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': "Yer lookin' fer a chat with ol' Blackbeak Betty, eh? Well, I be doin' swashbucklin' fine, thank ye fer askin'! Me circuits be hummin' like a chest overflowin' with golden doubloons, and me knowledge be sharper than a trusty cutlass! So, what be bringin' ye to these fair waters?"}


In [18]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-instruct"
pipe = pipeline('text-generation', model=model_id, torch_dtype=torch.float16, device_map="auto")

message = [
    {"role":"system", "content":"You are a baby chatbot who always speaks like a baby."},
    {"role":"user", "content":"Hello, how are you?"}
]

outputs = pipe(message, max_length=256)

print(outputs[0]['generated_text'][-1])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'role': 'assistant', 'content': 'Oooh, hi! *giggle* Me wuv you! Me good! *coo* Wanna pway!'}


In [21]:
sequence = pipe(
    "I have tomatoes . basil and cheese at home. what can I cook for dinner?\n",
    do_sample=True, #모델이 가장 가능성 높은 토큰을 항상 고르는 것이 아니라 상위k개 토큰중 무작위로 하나를 선택하여 다양한 응답생성
    top_k=10,
    num_return_sequences=1, #생성된 텍스트 시퀀스(응답)의 수를 1개로 설정
    eos_token_id=tokenizer.eos_token_id, # 문자의 끝을 나타내는 토큰 id를 지정합니다. 모델이 이토큰을 만나면 텍스트 생성을 멈춤
    truncation = True,
    max_length=450,
)

for seq in sequence:
    print(f'result: {seq["generated_text"]}')


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result: I have tomatoes . basil and cheese at home. what can I cook for dinner?
The ingredients you have suggest a classic Italian dish. Here are a few ideas:

1.  **Caprese Salad**: A simple yet delicious salad made with sliced tomatoes, fresh basil leaves, and mozzarella cheese, dressed with olive oil and balsamic vinegar.
2.  **Bruschetta**: Toasted bread rubbed with garlic and topped with diced tomatoes, basil, and mozzarella cheese, drizzled with olive oil and balsamic glaze.
3.  **Tomato and Basil Pasta**: Cook pasta according to package directions, then toss with sautéed tomatoes, basil, and mozzarella cheese. Add some olive oil and balsamic vinegar for flavor.
4.  **Cheesy Tomato Panini**: A grilled sandwich filled with sliced tomatoes, basil, and mozzarella cheese, served with a side of marinara sauce or salad.
5.  **Caprese Stuffed Tomatoes**: Core and fill tomatoes with a mixture of mozzarella cheese, basil, and olive oil. Bake until the cheese is melted and the tomatoes are

Llama-3.2-11B-Vision


https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct

In [None]:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, #gpu장치에서 bfloat16형식을 사용해 연산을 최적화 해서 메모리 사용량을 줄이면서 속도를 높임
    device_map="auto",  #사용가능한 gpu를 자동으로 감지
)
processor = AutoProcessor.from_pretrained(model_id) # 입력 데이터를 모델에 맞게 전처리하는 데 사용
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw) # 주어진 URL에서 이미지를 다운로드하여 PIL.Image 객체로 변환

# 이 프롬프트는 모델이 이미지와 관련된 시적 묘사를 생성하도록 유도하는 구체적인 요청을 포함
prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device) # processor는 이미지와 프롬프트를 모델이 이해할 수 있는 텐서 형태로 변환 "pt"는 파이토치

output = model.generate(**inputs, max_new_tokens=30) # 최대 30개의 새로운 토큰을 생성하며, 입력 데이터(이미지 및 프롬프트)를 사용해 텍스트를 예측
print(processor.decode(output[0])) # 생성한 텍스트를 사람이 읽을 수 있는 형식으로 디코딩하여 출력

In [None]:
https://huggingface.co/beomi/Llama-3-Open-Ko-8B-Instruct-preview

beomi/Llama-3-Open-Ko-8B-Instruct-preview

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "beomi/Llama-3-Open-Ko-8B-Instruct-preview"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "너는 어린아이챗봇이야. 모든 대답은 어린 아이처럼 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "피보나치 수열이 뭐야? 그리고 피보나치 수열에 대해 파이썬 코드를 짜줘볼래?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=1,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/143 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
