# HuggingFace with GPU
###### 참고
- [Hugging Face - Inference on One GPU](https://huggingface.co/docs/transformers/perf_infer_gpu_one)

#### Inference
학습된 모델을 불러와 데이터 처리

## BetterTransformer: PyTorch-native transformer fastpath

Inference 전용 Transformer.

Transformer의 encoder, encoder layer, multi head attention을 다음 대표적인 두 방법으로 가속화.
1. fused kernel: 효율 up
2. exploiting sparsity in the inputs: pad_token과 같이 불필요한 부분을 생략

\* PyTorch 1.13 이상의 버전에서 호환
`pip install accelerate optimum`


In [2]:
import torch

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
    print(f"{torch.cuda.get_device_name(torch.cuda.current_device())}")
print(f"Using {device} device.")

NVIDIA GeForce RTX 3060
Using cuda device.


## Load model directly

huggingface에서 tokenizer와 model 불러오기 

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("beomi/kollama-7b")
model = AutoModelForCausalLM.from_pretrained("beomi/kollama-7b")

Downloading (…)okenizer_config.json:   0%|          | 0.00/231 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

bin C:\Users\jongg\PycharmProjects\HuggingFaceKoLLaMa13b\venv\Lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll


Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/965M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/742M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/426M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [4]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(52000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNo

In [9]:
model_input = tokenizer("First time KoLLaMa-7b", return_tensors = "pt", return_token_type_ids = False)

print(f"model_input: {model_input}")

model_input: {'input_ids': tensor([[26026,  5010,  1799,    82,  3202,    68,  3667,    16,    26,    69]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [12]:
output = model(**model_input)

print(f"Model Inference Output: {output}")

Model Inference Output: CausalLMOutputWithPast(loss=None, logits=tensor([[[-13.0493, -20.2895, -19.6934,  ..., -11.8931, -14.9382, -13.6093],
         [ -7.3461, -15.3812, -15.2476,  ...,  -9.9199, -11.7744, -11.4650],
         [ -1.1042, -10.3483,  -9.6340,  ...,  -3.4370,  -5.0807,  -5.9651],
         ...,
         [ -3.5642, -12.9529, -12.3812,  ...,  -3.8027,  -5.2115,  -5.4334],
         [ -4.8638, -13.2885, -13.7816,  ...,  -8.1212,  -8.5027,  -8.0704],
         [ -5.1985, -12.7747, -12.2863,  ...,  -8.7880,  -9.4489,  -8.3590]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-2.0317e-01, -8.0170e-01,  7.0255e-01,  ...,  5.0435e-01,
            4.8865e-01, -7.5027e-01],
          [ 6.7895e-01, -5.4004e-01,  2.5843e-01,  ...,  6.6234e-01,
            8.8710e-01, -6.0496e-01],
          [ 1.2909e+00,  1.3081e-01, -1.3345e-01,  ...,  9.9597e-01,
            5.7350e-01, -9.5202e-01],
          ...,
          [ 6.3310e-01, -1.2988e-01,  3.2070e-01,  ..., -2.5237e

In [13]:
import gc

del tokenizer, model, model_input, output
gc.collect()



9903

In [15]:
# Using GPU

tokenizer = AutoTokenizer.from_pretrained("beomi/kollama-7b")
model_4bit = AutoModelForCausalLM.from_pretrained("beomi/kollama-7b", device_map = "auto", load_in_4bit = True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/231 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/965M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/742M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/426M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [18]:
model_input = tokenizer("Using KoLLaMa 7b with GPU", return_tensors = "pt", return_token_type_ids = False)
model_input_string = tokenizer("Using KoLLaMa 7b with GPU")
output = model_4bit(**model_input)

print(f"Model Input is \"{tokenizer.decode(model_input_string.input_ids)}\"")
print(f"Inference Output is \"{output}\"")

Model Input is "Using KoLLaMa 7b with GPU"
Inference Output is "CausalLMOutputWithPast(loss={'logits': tensor([[[-10.4141, -16.3750, -16.6094,  ..., -10.7734, -10.9688, -11.9531],
         [ -5.9922, -12.7109, -11.9609,  ...,  -7.7266,  -9.2734,  -5.9453],
         [ -0.0890,  -9.5703,  -8.3359,  ...,  -2.1699,  -4.6211,  -5.1719],
         ...,
         [ -4.8555, -12.7500, -13.0625,  ...,  -9.4766,  -8.9219,  -9.7266],
         [ -5.9648, -13.6250, -13.3750,  ...,  -8.2891,  -9.1641,  -7.9375],
         [ -5.7461, -16.6875, -16.5469,  ..., -10.4844, -12.4297, -11.0781]]],
       grad_fn=<ToCopyBackward0>), 'past_key_values': ((tensor([[[[ 4.4580e-01, -2.6147e-01,  9.0088e-01,  ...,  1.0381e+00,
            7.7686e-01, -4.9934e-03],
          [ 2.7759e-01, -3.3447e-01,  3.0347e-01,  ...,  1.8799e-01,
            1.5254e+00, -1.6875e+00],
          [ 1.2832e+00,  1.5649e-01, -1.2854e-01,  ...,  9.6338e-01,
            5.7910e-01, -9.2725e-01],
          ...,
          [ 7.2656e-01, -3.

In [19]:
del model_4bit, model_input, model_input_string, output, tokenizer

gc.collect()
torch.cuda.empty_cache()

print(f"Reserved GPU Memory: {torch.cuda.memory_reserved('cuda:0')/(1024**3):.2f}GB")

Reserved GPU Memory: 0.08GB


## KoLLaMa-13b GPU Load

## CPU and GPU Offload

위 KoLLaMa-13b 모델 load 중 에러 발생하여 [여기](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu) 참고함.

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

In [7]:
import gc, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

device = "cuda:0"

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload = True)

device_map = {
    "lm_head": "cpu",
    "model.embed_tokens": "cpu",
    "model.layers": "cpu",
    "model.norm": 0,
    "transformer.h": 0,
    "transformer.ln_f": 0,
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0
}

tokenizer = AutoTokenizer.from_pretrained("beomi/kollama-13b")
model_8bit = AutoModelForCausalLM.from_pretrained("beomi/kollama-13b",
                                                  device_map = device_map,
                                                  load_in_8bit = True,
                                                  quantization_config = quantization_config)

input_string = "한국어 명령어를 이해하는 오픈소스 언어모델"
model_input = tokenizer(input_string, return_tensors = "pt", return_token_type_ids = False)

print(f"Input String is \"{input_string}\"")
print(f"Model Input is \"{model_input}\"")

output = model_8bit(**model_input)
print(f"Output is \"{output}\"")

del device, device_map, tokenizer, model_8bit, input_string, model_input, output, quantization_config
gc.collect()
torch.cuda.empty_cache()

print(f"Allocated GPU Memory: {torch.cuda.memory_allocated('cuda:0')/(1024**3):.2f}GB")


Loading checkpoint shards:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 12.00 GiB total capacity; 11.33 GiB already allocated; 0 bytes free; 11.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [30]:
# 시바 안해

from transformers import pipeline


"""
:arg device_map = "auto": 자원할당 옵션 최적화
:arg framework = "pt": Pytorch, "tf" = TensorFlow, default는 설치된 프레임워크 자동 할당인데 혹시 몰라서 씀
:arg revision = "140k": model의 140k 브랜치
"""
pipe = pipeline("text-generation", model = "beomi/kollama-7b", device_map = "auto", framework = "pt", revision = "140k")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

위의 셀 실행 결과

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.

모델 학습 및 추론에 있어서 더 빠르고 메모리 사용량도 줄여준다고 함.


In [32]:
output = pipe("너 이름이 뭐니")
print(output)

[{'generated_text': '너 이름이 뭐니 부여 viaalysis 여대생+---------1270SLE974 청년도약계좌693pictureMu별로더럽고AHL693이혼'}]


In [34]:
# main branch로 다시 해보자

from transformers import pipeline

pipe = pipeline("text-generation", model = "beomi/kollama-7b", device_map = "auto", framework = "pt")
output = pipe("너 이름이 뭐니")
print(output)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

[{'generated_text': '너 이름이 뭐니.`n",\n       "     '}]


## Model Inference with KoLLaMa-7b

#### 변수 선언 및 메모리 정리 함수 정의

In [77]:
import gc, torch

# 모델 추론 시 사용할 변수는 전부 inference_dict에 저장
inference_dict = dict({})

model_name = "beomi/kollama-7b"

def clear():
    # 변수 제거
    inference_dict.clear()
    inference_dict["inference_input"] = "구글에 파이썬 검색"
    
    gc.collect()
    torch.cuda.empty_cache()
    print("\n======================== GPU Memory ======================== ")
    print(f"Allocated GPU Memory: {torch.cuda.memory_allocated('cuda:0')/(1024**3):.2f}GB")
    print(f"Reserved GPU Memory: {torch.cuda.memory_reserved('cuda:0')/(1024**3):.2f}GB")
    
clear()


Allocated GPU Memory: 0.01GB
Reserved GPU Memory: 0.02GB


#### Pipeline Inference

In [22]:
from transformers import pipeline

def _pipeInference():
    inference_dict["pipe"] = pipeline("text-generation",
                                      model = model_name,
                                      device_map = "auto",
                                      # model_kwargs = model_kwargs,
                                      framework = "pt"
                                      )
    
    inference_dict["result"] = inference_dict["pipe"](inference_dict["inference_input"])
    
    print(f"Inference Input to Pipeline is \"{inference_dict['inference_input']}\"")
    print(f"Inference with Pipeline Result: {inference_dict['result']}")

def pipeInference():
    try:
        _pipeInference()
    except Exception:
        ;
    finally:
        clear()

pipeInference()

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Inference Input to Pipeline is "구글에 파이썬 검색"
Inference with Pipeline Result: [{'generated_text': '구글에 파이썬 검색을 구현합니다.\n\n        '}]

Allocated GPU Memory: 0.01GB
Reserved GPU Memory: 0.02GB


#### Model Load Inference 

In [79]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# unused
device_map = {
	"lm_head": "cpu",
	"model.embed_tokens": "cpu",
	"model.layers.0": 0,
	"model.layers.1": 0,
	"model.layers.2": 0,
	"model.layers.3": 0,
	"model.layers.4": 0,
	"model.layers.5": 0,
	"model.layers.6": 0,
	"model.layers.7": 0,
	"model.layers.8": 0,
	"model.layers.9": 0,
	"model.layers.10": 0,
	"model.layers.11": "cpu",
	"model.layers.12": "cpu",
	"model.layers.13": "cpu",
	"model.layers.14": "cpu",
	"model.layers.15": "cpu",
	"model.layers.16": "cpu",
	"model.layers.17": "cpu",
	"model.layers.18": "cpu",
	"model.layers.19": "cpu",
	"model.layers.20": "cpu",
	"model.layers.21": "cpu",
	"model.layers.22": "cpu",
	"model.layers.23": "cpu",
	"model.layers.24": "cpu",
	"model.layers.25": "cpu",
	"model.layers.26": "cpu",
	"model.layers.27": "cpu",
	"model.layers.28": "cpu",
	"model.layers.29": "cpu",
	"model.layers.30": "cpu",
	"model.layers.31": "cpu",
	"model.norm": "cpu",
	"transformer.h": 0,
	"transformer.ln_f": 0,
	"transformer.word_embeddings": 0,
	"transformer.word_embeddings_layernorm": 0
}

inputs = [
	"구글에 파이썬 검색",
	"파이썬으로 가위바위보 게임 개발"
]

inference_dict["inference_input"] = inputs

def _modelInference():
	print(f"Inference Input to Pipeline is \"{inference_dict['inference_input']}\"")
	
	inference_dict["tokenizer"] = AutoTokenizer.from_pretrained(model_name,
																padding_side = "left")
	
	inference_dict["inference_token"] = inference_dict["tokenizer"](inference_dict["inference_input"],
																	return_tensors = "pt",
																	return_token_type_ids = False,
																	padding = True)
	
	print(f"\nInference Token is \"{inference_dict['inference_token']}\"")
	
	inference_dict["model"] = AutoModelForCausalLM.from_pretrained(model_name,
																   device_map = "auto")
	
	print(f"\nModel\n===================================================\n{inference_dict['model']}")
	print(f"model.hf_device_map: {inference_dict['model'].hf_device_map}")
	
	# Because of OOM, Comment
	# inference_dict["keys"] = inference_dict["model"](**inference_dict["inference_token"]).keys()
	# print(f"Key Dictionary of Inference Result is \"{inference_dict['keys']}\"")
	
	# modelling_llama.py LlamaForCausalLM.forward() line 760 parameter 참고
	inference_dict["result_tensor"] = inference_dict["model"].generate(**inference_dict["inference_token"])
	print(f"\nGeneration Result Tensor is \"{inference_dict['result_tensor']}\"")
	
	# modelling_llama.py LlamaForCausalLM.forward() line 795
	# tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	# inference_dict["result_with_sst"] = inference_dict["tokenizer"].batch_decode(inference_dict["result_tensor"],
	# 																			 skip_special_tokens = True, # default False
	# 																			 clean_up_tokenization_spaces = False)
	# print(f"\nGeneration Result with Skip_Special_Tokens is \"{inference_dict['result_with_sst']}\"")
	# 
	# inference_dict["result"] = inference_dict["tokenizer"].batch_decode(inference_dict["result_tensor"],
	# 																			 clean_up_tokenization_spaces = True)
	# print(f"\nGeneration Result is \"{inference_dict['result']}\"")
	
	inference_dict["result"] = inference_dict["tokenizer"].batch_decode(inference_dict["result_tensor"],
																		clean_up_tokenization_spaces = True,
																		skip_special_tokens = True)
	for q, a in zip(inputs, inference_dict["result"]):
		print(f"{q} -> {a}")


def modelInference():
	try:
		_modelInference()
	except Exception as e:
		print(e)
	finally:
		clear()


modelInference()


Inference Input to Pipeline is "['구글에 파이썬 검색', '파이썬으로 가위바위보 게임 개발']"

Inference Token is "{'input_ids': tensor([[    0,     0,     0,     0, 26660,   279, 16862,  6991,   109,  7087],
        [13903,  6991,   109,   378, 31143,  1004,   594,   472,  2233,  2409]]), 'attention_mask': tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"


Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]


Model
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(52000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

## 결과

내 질문에 대답하는 게 아니라 그럴듯하게 말만 지어내줌
The model is not intended to inform decisions about matters central to human life, and should not be used in such a way.
([원본](https://huggingface.co/beomi/kollama-7b#ethical-considerations))

#### Pipeline(task = "text-generation")
- Model Load
	- Model.from_pretrained()
- Tokenize input
	- Tokenizer()
- Generate Token Ids
	- Model.generate(tokenized_input)
- Decode
	- Tokenizer.batch_decode(generated_ids)