# 🚩 문제지 (Mission Brief)

## 🏕️ 스토리: “캠핑 마스터 AI”
국내 최대 아웃도어 기업 **캠핑 마스터**는 △장비 추천 챗봇 △캠핑장 리뷰 분석 △사전 안전 점검 알람 등을 제공하는 **통합 AI 어시스턴트**를 출시하려 합니다. 여러분은 막 입사한 **NLP 엔지니어**로서, 다음 네 단계의 모듈을 순차적으로 구축‧실험해야 합니다.

각 단계의 **starter code**는 50 % 이상 완성돼 있으니, `### TODO` 표시 부분만 채워서 실행·검증하세요.

> **환경 준비** (필수)  
> ```bash
> pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
> pip install datasets transformers accelerate seqeval
> ```


In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install datasets transformers accelerate seqeval

Looking in indexes: https://download.pytorch.org/whl/cpu


### 과제 1 (의도 분류 GRU)  
**목표** : 사용자의 자연어 요청을 150 개 의도로 분류해 챗봇 라우팅 성능을 평가한다.  
**데이터** : Hugging Face → `clinc_oos`, config `"small"`  

```python
from datasets import load_dataset
raw = load_dataset("clinc_oos", "small")
print(raw["train"][0])
```
#### Starter Code


In [2]:
from datasets import load_dataset
!pip install --upgrade datasets huggingface_hub
raw = load_dataset("clinc_oos", "small", split="train")
print(raw[0])



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/172k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/77.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/136k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5500 [00:00<?, ? examples/s]

{'text': 'can you walk me through setting up direct deposits to my bank of internet savings account', 'intent': 108}


In [3]:
! pip install transformers datasets



In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

MAX_LEN, BATCH = 32, 64
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
pad_id = tok.pad_token_id

def tokenize(b):
    return tok(b["text"], padding="max_length",
               truncation=True, max_length=MAX_LEN)

ds = load_dataset("clinc_oos", "small")

# 컬럼명 intent → labels
ds = ds.rename_column("intent", "labels")

# 토크나이즈
ds = ds.map(tokenize, batched=True)

# Torch 텐서 형식으로 지정
ds.set_format(type="torch",
              columns=["input_ids", "attention_mask", "labels"])

train_dl = DataLoader(ds["train"], batch_size=BATCH, shuffle=True)
test_dl  = DataLoader(ds["test"],  batch_size=BATCH)


class GRUClassifier(nn.Module):
    def __init__(self, vocab, embed_dim, hidden, num_labels):
        super().__init__()
        self.embed = nn.Embedding(vocab, embed_dim, padding_idx=pad_id)
        self.gru   = nn.GRU(embed_dim, hidden, batch_first=True, bidirectional=True)
        self.fc    = nn.Linear(hidden*2, num_labels)

    def forward(self, ids, mask):
        x = self.embed(ids)
        x, _ = self.gru(x)
        x = torch.cat([x[:,0,:hidden], x[:,-1,hidden:]], dim=-1)  # 양방향 첫·끝 스텝 결합
        return self.fc(x)

### TODO 1-a : 임베딩 차원, hidden size 등 하이퍼파라미터 지정
vocab_size = tok.vocab_size
embed_dim = 128
hidden = 64
num_labels = ds["train"].features["labels"].num_classes

model = GRUClassifier(vocab_size, embed_dim, hidden, num_labels)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
EPOCHS = 3

### TODO 1-b : 훈련 루프(optimizer, loss 정의 및 epoch 반복) 작성
model.train()
for epoch in range(EPOCHS):
    total_loss = 0
    for batch in train_dl:
        ids = batch['input_ids']
        mask = batch['attention_mask']
        labels = batch['labels']
        optimizer.zero_grad()
        logits = model(ids, mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {total_loss/len(train_dl):.4f}")

### TODO 1-c : 테스트 셋 정확도 출력
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_dl:
        ids = batch['input_ids']
        mask = batch['attention_mask']
        labels = batch['labels']
        logits = model(ids, mask)
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
print(f"Test Accuracy: {correct/total*100:.2f}%")


Map:   0%|          | 0/3100 [00:00<?, ? examples/s]

Epoch 1/3, Loss: 5.0270
Epoch 2/3, Loss: 5.0205
Epoch 3/3, Loss: 5.0189
Test Accuracy: 18.18%


---

### 과제 2 (BiLSTM + CRF NE 인식)  
**목표** : 캠핑 준비물 체크리스트에서 **장비·지역·날짜** 등 엔티티를 추출한다.  
**데이터** : `conll2003` (영문) – 실제 캠핑 도메인과 다르지만, 모델 아키텍처 학습용으로 사용.  

> CRF 계층 구현이 부담스러우면, `seqeval` 평가 지표와 **BiLSTM+Linear** 구조만 완성해도 통과점수를 얻을 수 있다.

```python
# 데이터 예시
from datasets import load_dataset
ner = load_dataset("conll2003")
print(ner["train"][0])
```
#### Starter Code (BiLSTM Tagger skeleton)


In [6]:
class BiLSTMTagger(nn.Module):
    def __init__(self, vocab, tagset, embed_dim=100, hidden=256):
        super().__init__()
        self.embed = nn.Embedding(vocab, embed_dim)
        self.lstm  = nn.LSTM(embed_dim, hidden, num_layers=1,
                             batch_first=True, bidirectional=True)
        self.fc    = nn.Linear(hidden*2, tagset)

    def forward(self, ids, mask):
        em = self.embed(ids)
        out, _ = self.lstm(em)
        return self.fc(out)          # [batch, seq, tag]

### TODO 2-a : 토크나이저를 훈련 셋 vocab 으로 학습
ner = load_dataset("conll2003")
train_ner = ner["train"]
word2idx = {"<PAD>":0, "<UNK>":1}
tag_names = train_ner.features['ner_tags'].feature.names
tag2idx = {t:i for i,t in enumerate(tag_names)}
pad_tag = -100
for sample in train_ner:
    for w in sample['tokens']:
        if w not in word2idx:
            word2idx[w] = len(word2idx)
### TODO 2-b : padding & masking 함수 구현
def pad_and_mask(batch):
    seqs, labs = zip(*batch)
    max_len = max(len(s) for s in seqs)
    input_ids = [s + [0]*(max_len-len(s)) for s in seqs]
    label_ids = [l + [pad_tag]*(max_len-len(l)) for l in labs]
    mask = [[1]*len(s) + [0]*(max_len-len(s)) for s in seqs]
    return torch.tensor(input_ids), torch.tensor(mask), torch.tensor(label_ids)

train_dl2 = DataLoader([( [word2idx.get(w,1) for w in sample['tokens']], sample['ner_tags']) for sample in train_ner],
                        batch_size=32, shuffle=True, collate_fn=pad_and_mask)

### TODO 2-c : CrossEntropyLoss(ignore_index=pad_tag) 로 모델 학습

model2 = BiLSTMTagger(len(word2idx), len(tag2idx))
optimizer2 = torch.optim.Adam(model2.parameters(), lr=1e-3)
criterion2 = nn.CrossEntropyLoss(ignore_index=pad_tag)
for epoch in range(3):
    model2.train()
    total_loss = 0
    for ids, mask, labs in train_dl2:
        optimizer2.zero_grad()
        logits = model2(ids, mask)
        loss = criterion2(logits.view(-1, logits.size(-1)), labs.view(-1))
        loss.backward()
        optimizer2.step()
        total_loss += loss.item()
    print(f"NER Epoch {epoch+1}, Loss: {total_loss/len(train_dl2):.4f}")
### TODO 2-d : seqeval (f1) 로 성능 측정
from seqeval.metrics import classification_report
model2.eval()
true_preds, true_labels = [], []
with torch.no_grad():
    for ids, mask, labs in train_dl2:
        logits = model2(ids, mask)
        preds = torch.argmax(logits, dim=-1).tolist()
        labs = labs.tolist()
        for p, l, m in zip(preds, labs, mask.tolist()):
            true_preds.append([tag_names[i] for i,mi in zip(p,m) if mi])
            true_labels.append([tag_names[i] for i,mi in zip(l,m) if mi])
print(classification_report(true_labels, true_preds))




README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

NER Epoch 1, Loss: 0.5261
NER Epoch 2, Loss: 0.2367
NER Epoch 3, Loss: 0.1273
              precision    recall  f1-score   support

         LOC       0.95      0.89      0.92      7140
        MISC       0.87      0.78      0.82      3438
         ORG       0.86      0.80      0.83      6321
         PER       0.92      0.90      0.91      6600

   micro avg       0.91      0.85      0.88     23499
   macro avg       0.90      0.84      0.87     23499
weighted avg       0.90      0.85      0.88     23499



---

### 과제 3 (1D CNN 리뷰 감성 분류)  
**목표** : 캠핑장 리뷰를 긍정/부정으로 분류하여 별점 이상/이하를 탐지한다.  
**데이터** : Hugging Face → `imdb` (binary sentiment)

#### Starter Code


In [9]:
class TextCNN(nn.Module):
    def __init__(self, vocab, embed_dim=128, num_classes=2, k=[3,4,5]):
        super().__init__()
        self.embed = nn.Embedding(vocab, embed_dim, padding_idx=pad_id)
        self.convs = nn.ModuleList(
            [nn.Conv1d(embed_dim, 100, kernel_size=ks) for ks in k])
        self.fc = nn.Linear(100*len(k), num_classes)

    def forward(self, ids):
        x = self.embed(ids).transpose(1,2)                # [B,C,L]
        x = torch.cat([torch.relu(c(x)).max(2)[0] for c in self.convs], 1)
        return self.fc(x)

### TODO 3-a : 토큰화·DataLoader 작성
imdb = load_dataset("imdb")
train_imdb = imdb['train']
test_imdb = imdb['test']
def tokenize3(text):
    return tok(text, padding='max_length', truncation=True, max_length=MAX_LEN)
train_dl3 = DataLoader([(torch.tensor(tokenize3(x)['input_ids']), torch.tensor(y)) for x,y in zip(train_imdb['text'], train_imdb['label'])], batch_size=BATCH, shuffle=True)
test_dl3  = DataLoader([(torch.tensor(tokenize3(x)['input_ids']), torch.tensor(y)) for x,y in zip(test_imdb['text'], test_imdb['label'])], batch_size=BATCH)

### TODO 3-b : 훈련 루프 및 정확도 평가
model3 = TextCNN(tok.vocab_size)
optimizer3 = torch.optim.Adam(model3.parameters(), lr=1e-3)
criterion3 = nn.CrossEntropyLoss()
for epoch in range(3):
    model3.train()
    total_loss = 0
    for ids, labels in train_dl3:
        optimizer3.zero_grad()
        logits = model3(ids)
        loss = criterion3(logits, labels)
        loss.backward()
        optimizer3.step()
        total_loss += loss.item()
    print(f"CNN Epoch {epoch+1}, Loss: {total_loss/len(train_dl3):.4f}")

model3.eval()
correct3, total3 = 0, 0
with torch.no_grad():
    for ids, labels in test_dl3:
        preds = torch.argmax(model3(ids), dim=1)
        correct3 += (preds == labels).sum().item()
        total3 += labels.size(0)
print(f"CNN Test Accuracy: {correct3/total3*100:.2f}%")


CNN Epoch 1, Loss: 0.6282
CNN Epoch 2, Loss: 0.4573
CNN Epoch 3, Loss: 0.2683
CNN Test Accuracy: 68.63%


## 데이터셋 다운로드 요약

| 과제 | 데이터셋 (🤗 Datasets) | 호출 예시 |
|------|-----------------------|-----------|
| 1 | `clinc_oos`, `"small"` | `load_dataset("clinc_oos","small")` |
| 2 | `conll2003` | `load_dataset("conll2003")` |
| 3 | `imdb` | `load_dataset("imdb")` |
| 4 | `daily_dialog` | `load_dataset("daily_dialog")` |

### 마무리 팁
- 각 과제 스크립트는 `torch.save(model.state_dict(), "...")`로 체크포인트를 저장해두면, 후속 실험(하이퍼파라미터 튜닝, 앙상블) 시 시간을 절약할 수 있습니다.  
- CPU 에서 빠르게 실습하려면 `datasets` 의 `.select(range(N))`로 샘플 수를 줄여 디버깅 후, GPU 훈련 환경에서 전체 데이터로 재훈련하세요.

성공적인 **캠핑 마스터 AI** 출시를 기원합니다! 🏕️