# KoLLaMa
Large Language Model Meta AI

[HuggingFace beomi/kollama-7b](https://huggingface.co/beomi/kollama-7b)
AutoModelForCausalLM 사용

- CausalLM(Causal Language Model, 인과적 언어모형): 조건부 확률 형태의 언어모형
	- 인공신경망 등의 모형으로 구현하기 쉬움
	- 단어를 순서대로 생성할 수 있음

[preprocess.ipynb](/datas/KoLLaMa/Dataset.csv)에서 위의 KcELECTRA에서 사용한 Dataset.csv를 수정해 문장과 순서가 뒤섞인 문장으로 이루어진 새로운 Dataset.csv 파일을 생성함.

BBPE Tokenizer 사용
- Byte-level Byte Pair Encoding

KoLLaMa-7b fine tuning 후 실행

## environment

In [1]:
from src.Common.common import *

# Dataset.csv
import pandas as pd

# HuggingFace
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# PyTorch
from torch.utils.data import Dataset

In [2]:
c = CommonObj()

c.KoLLaMa_MODEL_NAME = "beomi/kollama-7b"
c.SAVED_MODEL_NAME = "../../models/KcELECTRA"

In [3]:
c.df = pd.read_csv("../../datas/KoLLaMa/Dataset.csv", sep = "\t")
c.df.head()

Unnamed: 0.1,Unnamed: 0,mixed,not_mixed
0,0,이종석 한효주 나오는 드라마 이후로 드라마 안봤다. 2년전인가?? 좀 신선했었지. ...,이종석 한효주 나오는 드라마 이후로 드라마 안봤다. 2년전인가?? 좀 신선했었지. ...
1,1,꽂등심이다ㅠㅜ 저녁은 오늘 술프노... 씨바알..노무노무,씨바알..노무노무 술프노... 오늘 저녁은 꽂등심이다ㅠㅜ
2,2,꺼라ㅡ패쓰 짱깨,짱깨 꺼라ㅡ패쓰
3,3,그들의 사생활 지금 ~ 설리를 위해서라도 모두 조용하길 고인이된 누굴 탓한다고 무슨...,그들의 사생활 ~ 고인이된 설리를 위해서라도 모두 조용하길 지금 누굴 탓한다고 무슨...
4,4,아무리 법이 공공의 무슨 자격으로 개인의 신상정보를 불특정 다수에게 안되네요 도저히...,아무리 법이 뭣같아도 무슨 자격으로 개인의 신상정보를 불특정 다수에게 공개하는지 도...


## Train set / Test set

In [4]:
c.train_data = c.df.sample(frac = 0.8, random_state = 42)
c.test_data = c.df.drop(c.train_data.index)

print(f"Train data length {len(c.train_data)}")
print(f"Test data length {len(c.test_data)}")

Train data length 8000
Test data length 2000


## Tokenizer

In [5]:
c.tokenizer = AutoTokenizer.from_pretrained(c.KoLLaMa_MODEL_NAME, device_map = "auto")
# c.tokenizer = AutoTokenizer.from_pretrained(c.SAVED_MODEL_NAME, device_map = "auto")

c.tokenizer

PreTrainedTokenizerFast(name_or_path='beomi/kollama-7b', vocab_size=52000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|sep|>', 'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

#### Train data Tokenized

In [6]:
c.tokenized_train_X = c.tokenizer(
    list(c.train_data["mixed"]),
    return_tensors = "pt",
	max_length = 128,
	padding = True,
	truncation = True,
	add_special_tokens = True
)

c.tokenized_train_y = c.tokenizer(
    list(c.train_data["not_mixed"]),
    return_tensors = "pt",
	max_length = 128,
	padding = True,
	truncation = True,
	add_special_tokens = True
)

#### Test data Tokenized

In [7]:
c.tokenized_test_X = c.tokenizer(
    list(c.test_data["mixed"]),
    return_tensors = "pt",
	max_length = 128,
	padding = True,
	truncation = True,
	add_special_tokens = True
)

c.tokenized_test_y = c.tokenizer(
    list(c.test_data["not_mixed"]),
    return_tensors = "pt",
	max_length = 128,
	padding = True,
	truncation = True,
	add_special_tokens = True
)

In [8]:
print(c.tokenized_test_X[0])
print(c.tokenized_test_X[0].tokens)
print(c.tokenized_test_y[0])
print(c.tokenized_test_y[0].tokens)

Encoding(num_tokens=128, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
['ê½Ĥ', 'ëĵ±', 'ìĭ¬', 'ìĿ´ëĭ¤', 'áħ²', 'áħ®', 'ĠìłĢëħģ', 'ìĿĢ', 'Ġìĺ¤ëĬĺ', 'ĠìĪł', 'íĶĦ', 'ëħ¸', '...', 'ĠìĶ¨', 'ë°Ķ', 'ìķĮ', '..', 'ëħ¸', 'ë¬´ë', 'ħ¸', 'ë¬´', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|end

In [9]:
print(c.tokenizer.convert_tokens_to_string(c.tokenized_test_X[0].tokens))

꽂등심이다ᅲᅮ 저녁은 오늘 술프노... 씨바알..노무노무<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endof

## Custom Dataset

#### 1. Dataset
torch.utils.data.Dataset을 상속한 클래스를 직접 구현해 데이터 load 방식을 직접 설정


#### 2. DataLoader
torch.utils.data.DataLoader 클래스를 직접 사용
`train_dataloader = DataLoader((train_X, train_y), batch_size = 4, shuffle = True)`

In [10]:
class CustomDataset(Dataset):
	def __init__(self, data_X, data_y):
		self.train_data = []
		for x, y in zip(data_X["input_ids"],data_y["input_ids"]):
			self.train_data.append((x, y))
	
	def __len__(self):
		return len(self.train_data)
	
	def __getitem__(self, idx):
		X, y = self.train_data[idx]
		return X.clone().detach(), y.clone().detach()

c.train_dataset = CustomDataset(c.tokenized_train_X, c.tokenized_train_y)
c.test_dataset = CustomDataset(c.tokenized_test_X, c.tokenized_test_y)

## Training Arguments

In [11]:
c.training_args = TrainingArguments(
	output_dir = "./",
	num_train_epochs = 10,
	per_device_train_batch_size = 8,
	per_device_eval_batch_size = 64,
	logging_dir = "./logs",
	logging_steps = 500, # 학습 log 기록 단위
	log_level = "warning", # default
	save_total_limit = 2 # 학습 결과 저장 최대 갯수
)

## Model Load

In [12]:
c.model = AutoModelForCausalLM.from_pretrained(c.KoLLaMa_MODEL_NAME, num_labels = 2).to("cuda:0")
# c.model = AutoModelForCausalLM.from_pretrained(c.SAVED_MODEL_NAME, num_labels = 2).to("cuda:0")

c.model

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 12.00 GiB total capacity; 11.21 GiB already allocated; 0 bytes free; 11.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

# Memory 딸려 시부럴~

In [13]:
print(c.model.hf_device_map)

AttributeError: 'CommonObj' object has no attribute 'model'

## Trainer

In [82]:
c.trainer = Trainer(
	model = c.model,
	args = c.training_args,
	train_dataset = c.train_dataset,
	eval_dataset = c.test_dataset,
	compute_metrics = compute_metrics
)

NotImplementedError: Cannot copy out of meta tensor; no data!

In [14]:
del c