### 1. 데이터 로드
- IMDB 데이터셋 중 샘플 50개만 가져와서 학습용/평가용 8:2로 나눔

In [24]:
from datasets import load_dataset
import time
from tqdm.auto import tqdm

print("데이터셋 다운로드 시작...")
start_time = time.time()

# 프로그레스바와 함께 데이터셋 로딩
print("📥 IMDB 데이터셋을 다운로드하고 있습니다...")
print("   (처음 실행시 인터넷에서 다운로드하므로 시간이 걸릴 수 있습니다)")

# tqdm을 사용한 더 자세한 프로그레스바
with tqdm(total=100, desc="다운로드 진행률", unit="%", 
          bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]') as pbar:
    
    # 데이터셋 로딩 (50개 샘플)
    dataset = load_dataset("imdb", split="train[:50]").train_test_split(test_size=0.2)
    
    # 프로그레스바 완료
    pbar.n = 100
    pbar.refresh()

end_time = time.time()
print(f"✅ 데이터셋 로딩 완료! 소요시간: {end_time - start_time:.2f}초")
print(f"📊 훈련 데이터: {len(dataset['train'])}개, 테스트 데이터: {len(dataset['test'])}개")

데이터셋 다운로드 시작...
📥 IMDB 데이터셋을 다운로드하고 있습니다...
   (처음 실행시 인터넷에서 다운로드하므로 시간이 걸릴 수 있습니다)


다운로드 진행률: 100%|██████████| 100/100 [00:04<00:00]

✅ 데이터셋 로딩 완료! 소요시간: 4.75초
📊 훈련 데이터: 40개, 테스트 데이터: 10개





In [25]:
sample = dataset["train"][5]
print(f"리뷰 내용 : {sample['text']}")
print(f"레이블 (0:부정, 1:긍정): {sample['label']}")

print(len(dataset))

리뷰 내용 : When I first saw a glimpse of this movie, I quickly noticed the actress who was playing the role of Lucille Ball. Rachel York's portrayal of Lucy is absolutely awful. Lucille Ball was an astounding comedian with incredible talent. To think about a legend like Lucille Ball being portrayed the way she was in the movie is horrendous. I cannot believe out of all the actresses in the world who could play a much better Lucy, the producers decided to get Rachel York. She might be a good actress in other roles but to play the role of Lucille Ball is tough. It is pretty hard to find someone who could resemble Lucille Ball, but they could at least find someone a bit similar in looks and talent. If you noticed York's portrayal of Lucy in episodes of I Love Lucy like the chocolate factory or vitavetavegamin, nothing is similar in any way-her expression, voice, or movement.<br /><br />To top it all off, Danny Pino playing Desi Arnaz is horrible. Pino does not qualify to play as Ricky. He's 

In [26]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### 데이터 전처리

In [27]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

dataset = dataset.map(tokenize, batched=True)

Map: 100%|██████████| 40/40 [00:00<00:00, 4304.39 examples/s]
Map: 100%|██████████| 10/10 [00:00<00:00, 1572.55 examples/s]


#### 훈련 설정 및 Trainer 구성

In [28]:
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

# 정확도 계산 함수
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc}

args = TrainingArguments(
    output_dir="test",
    per_device_train_batch_size=8,
    num_train_epochs=15,
    report_to="none",   # 외부 로깅툴 비활성화
    logging_steps=1,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics,    # 정확도 계산 함수 추가
)

#### 파인튜닝 실행

In [29]:
trainer.train()



Step,Training Loss
1,0.657
2,0.4907
3,0.3389
4,0.2288
5,0.1419
6,0.1057
7,0.0689
8,0.044
9,0.0313
10,0.0237


TrainOutput(global_step=75, training_loss=0.030708623349200933, metrics={'train_runtime': 30.287, 'train_samples_per_second': 19.81, 'train_steps_per_second': 2.476, 'total_flos': 79480439193600.0, 'train_loss': 0.030708623349200933, 'epoch': 15.0})

- 1차 실행 결과
```text
metrics={'train_runtime': 34.6252, 'train_samples_per_second': 17.328, 'train_steps_per_second': 2.166, 'total_flos': 79480439193600.0, 'train_loss': 0.03811097462972005, 'epoch': 15.0})
```

- 2차 실행 결과
```text
TrainOutput(global_step=75, training_loss=0.030708623349200933, metrics={'train_runtime': 30.287, 'train_samples_per_second': 19.81, 'train_steps_per_second': 2.476, 'total_flos': 79480439193600.0, 'train_loss': 0.030708623349200933, 'epoch': 15.0})
```

#### 학습된 모델로 실제 예측 수행

In [33]:
# text = "I would put this at the top of the list of films in the category of unwatchable trash."
text = "I can watch this all day."
# "pt" : pytorch 형식으로 변환
# 입력된 문장을 토큰화 하여 mps에 전달
inputs = tokenizer(text, return_tensors="pt").to("mps")

outputs = model(**inputs)
# 가장 높은 점수의 인덱스를 예측값으로 사용
predictions = outputs.logits.argmax(dim=-1)

print(f"예측 레이블: {predictions[0]}")
print(f"예측 확률: {outputs.logits[0][predictions[0]].item()}")
print("긍정" if predictions[0] == 1 else "부정")

label = outputs.logits.argmax(dim=-1).item()
print(label)
print(outputs.logits[0])
print("긍정" if label == 1 else "부정")

예측 레이블: 0
예측 확률: 2.584162473678589
부정
0
tensor([ 2.5842, -2.5814], device='mps:0', grad_fn=<SelectBackward0>)
부정
