# 한국어 NSMC 감성분류

- https://github.com/e9t/nsmc 는 네이버 영화 리뷰 코퍼스로, Train 데이터셋과 Text 데이터셋으로 구성되어 있다.
- 먼저 해당 GitHub으로부터 데이터를 Clone하라. 
- Clone 후 Train 데이터셋은 5,000건 Test 데이터셋은 500건을 활용하라. (훈련시간 이슈; 하드웨어 여유가 있다면 그 이상도 가능 But No Extra Point)
- https://github.com/bentrevett/pytorch-sentiment-analysis 에 있는 pytorch sentiment analysis의 방법을 따라 한국어 감성분석 모델을 만들어라



## 목표


- 깃헙 내의 txt 파일을 불러온 후 torchtext를 사용하여 데이터를 신경망에 입력가능한 꼴로 바꾸기
- 한국어 데이터 전처리를 위한 함수를 만들고 이를 torchtext에 통합하기 
- 제시된 여러 모델을 사용하여(transformers 제외) 성능을 향상 시키기
- training, evaluation 한 것을 test 데이터에 적용하여 성능을 보이기.
- predict를 사용하여 제시된 영화 리뷰들의 분류 결과를 보이기

- 참고 사이트
    
    - https://pytorch.org/text/
    - http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
    - https://github.com/pytorch/text
    - https://mc.ai/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch/
    - https://github.com/bentrevett

## 제출


- 이메일: 장동준 qwer4107@snu.ac.kr 
- 마감: 2024년 5월 27일 월요일 오후 11시 59분 59초까지!

- **주의사항**
    1) .ipynb 파일로 제출할 것
    2) Colab으로 구축했을 시 특히 런타임 초기화 후 코드 전체 실행했을 때 오류 없는 지 확인!
    3) User Input 부분의 결과를 잘 나타낼 것 
    4) 새로운 모듈 및 라이브러리 설치 시 # !pip로 해당 모듈 명시

## 정리
- 구현한 시스템의 성능을 정리


### 데이터 준비
- git clone을 통해 네이버 영화 리뷰 데이터 파일을 준비

In [1]:
!git clone https://github.com/e9t/nsmc

fatal: destination path 'nsmc' already exists and is not an empty directory.


In [3]:
%pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.5 kB)
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.1-py3-none-any.whl.metadata (12 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.1-py

In [1]:
%pip install konlpy
%pip install tqdm
%pip install torchtext

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting JPype1>=0.7.0 (from konlpy)
  Downloading JPype1-1.5.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (4.9 kB)
Collecting lxml>=4.1.0 (from konlpy)
  Downloading lxml-5.2.2-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.4 kB)
Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading JPype1-1.5.0-cp312-cp312-macosx_10_9_universal2.whl (587 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.9/587.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading lxml-5.2.2-cp312-cp312-macosx_10_9_universal2.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: lxml, JPype1, kon

In [4]:
## 프로그래밍 시작
import torch
import pandas as pd
import math
import collections
import tqdm
from torch.utils.data import IterableDataset
import torch.nn as nn
import torch.optim as optim
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, GloVe
from konlpy.tag import Okt
import datasets
import numpy as np
import re
import warnings
warnings.filterwarnings(action='ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# torch version: 2.3.0
torch.__version__

'2.3.0'

In [6]:
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)

### 데이터 전처리
데이터 프레임 전처리
- regular expression을 사용하여 한글만 데이터에서 처리한다

In [6]:
# 데이터프레임 전처리
def preprocess_dataframe(input_dataframe:pd.DataFrame)->pd.DataFrame:
    return_dataframe = input_dataframe.copy(deep=True)
    return_dataframe.drop_duplicates(subset=['document'], inplace=True)
    # 한글, 공백만 포함
    # return_dataframe['korean'] = return_dataframe['document'].str.replace(pat=r"[^ㄱ-ㅎㅏ-ㅣ가-힣 ]", repl=r"", regex=True)
    # 영어도 포함시킬려면
    # return_dataframe['document'] = return_dataframe['document'].str.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣a-zA-Z ]","")
    # 공백만 존재할 경우 na로 대체한다.
    # return_dataframe['korean'] = return_dataframe['korean'].replace('', np.nan)
    # na만 존재하는 행들을 제거한다
    return_dataframe = return_dataframe.dropna(how='any')
    return return_dataframe

- 데이터 프레임 불러오기
- train:    150000
- test:     50000

In [8]:
train_df = pd.read_table("ratings_train.txt")
test_df = pd.read_table("ratings_test.txt")
len(train_df), len(test_df)

FileNotFoundError: [Errno 2] No such file or directory: 'ratings_train.txt'

데이터 전처리 후, korean 컬럼에 한국어만을 담은 데이터를 저장
- train: 145791
- test: 48995

In [9]:
preprocessed_train_df = preprocess_dataframe(train_df)
preprocessed_test_df = preprocess_dataframe(test_df)
preprocessed_train_df.shape, preprocessed_test_df.shape

((146182, 3), (49157, 3))

In [10]:
# 각 데이터프레임을 datasets로 표현
train_data = datasets.Dataset.from_pandas(preprocessed_train_df)
test_data = datasets.Dataset.from_pandas(preprocessed_test_df)

In [11]:
# konlp의 Okt tokenizer를 사용
kor_tokenizer = get_tokenizer(Okt().morphs)

In [12]:
# 앞에서 설정한 불용어를 제거하고, otk tokenizer를 사용하여 리뷰 데이터를 tokenize한다
def kor_tokenize(review, tokenizer, max_length):
    tokens = tokenizer(review["document"])[:max_length]
    length = len(tokens)
    return {"tokens": tokens, "length": length}

In [13]:
max_length=30
train_data = train_data.map(
    kor_tokenize, fn_kwargs={"tokenizer": kor_tokenizer, "max_length": max_length}
)

Map:   0%|          | 0/146182 [00:00<?, ? examples/s]

Map: 100%|██████████| 146182/146182 [04:05<00:00, 596.28 examples/s]


In [14]:
test_data = test_data.map(
    kor_tokenize, fn_kwargs={"tokenizer":kor_tokenizer, "max_length":max_length}
)

Map:   0%|          | 0/49157 [00:00<?, ? examples/s]

Map: 100%|██████████| 49157/49157 [01:25<00:00, 575.38 examples/s]


In [15]:
# 불용어만 존재하여 토큰이 없는 데이터를 삭제
def filter_empty_tokens(example):
    return len(example["tokens"]) > 0

train_data = train_data.filter(filter_empty_tokens)
test_data = test_data.filter(filter_empty_tokens)

Filter: 100%|██████████| 146182/146182 [00:03<00:00, 46214.04 examples/s]
Filter: 100%|██████████| 49157/49157 [00:01<00:00, 47446.40 examples/s]


In [16]:
# 학습 데이터를 8:2로 나누어 학습 데이터, 검증 데이터로 분리
test_size = 0.2
train_valid_data = train_data.train_test_split(test_size=test_size)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

In [17]:
# <unk>, <pad> 특수 토큰을 추가한다
special_tokens = ["<unk>", "<pad>"]
min_freq = 5

vocab = build_vocab_from_iterator(
    train_data["tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [18]:
unk_index = vocab["<unk>"]
pad_index = vocab["<pad>"]

In [19]:
vocab.set_default_index(unk_index)

In [20]:
def numericalize(example, vocab):
    ids = vocab.lookup_indices(example["tokens"])
    return {"ids": ids}

In [21]:
# 리뷰 데이터들을 vocab에 존재하는 단어들로 mapping한다
train_data = train_data.map(numericalize, fn_kwargs={"vocab": vocab})
valid_data = valid_data.map(numericalize, fn_kwargs={"vocab": vocab})
test_data = test_data.map(numericalize, fn_kwargs={"vocab": vocab})

Map: 100%|██████████| 116945/116945 [00:17<00:00, 6501.38 examples/s]
Map: 100%|██████████| 29237/29237 [00:04<00:00, 6461.15 examples/s]
Map: 100%|██████████| 49157/49157 [00:07<00:00, 6465.14 examples/s]


In [22]:
# 데이터셋을 torch 형태로 변환한다.
train_data = train_data.with_format(type="torch", columns=["ids", "label", "length"])
valid_data = valid_data.with_format(type="torch", columns=["ids", "label", "length"])
test_data = test_data.with_format(type="torch", columns=["ids", "label", "length"])

In [23]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [i["ids"] for i in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_length = [i["length"] for i in batch]
        batch_length = torch.stack(batch_length)
        batch_label = [i["label"] for i in batch]
        batch_label = torch.stack(batch_label)
        batch = {"ids": batch_ids, "length": batch_length, "label": batch_label}
        return batch

    return collate_fn

In [24]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [25]:
# 각 데이터를 data_loader로 변환하여 모델에 실을 준비를 한다.
batch_size = 512

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

In [26]:
vectors = GloVe()
pretrained_embedding = vectors.get_vecs_by_tokens(vocab.get_itos())

### NBoW(Neaural Bag of Words) 모델

In [27]:
class NBoW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim, pad_index):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        self.fc = nn.Linear(embedding_dim, output_dim)

    def forward(self, ids):
        # ids = [batch size, seq len]
        embedded = self.embedding(ids)
        # embedded = [batch size, seq len, embedding dim]
        pooled = embedded.mean(dim=1)
        # pooled = [batch size, embedding dim]
        prediction = self.fc(pooled)
        # prediction = [batch size, output dim]
        return prediction

In [28]:
vocab_size = len(vocab)
embedding_dim = 300
output_dim = len(train_data.unique("label"))

NBoW_model = NBoW(vocab_size, embedding_dim, output_dim, pad_index)

In [29]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(NBoW_model):,} trainable parameters")

The model has 5,485,502 trainable parameters


In [30]:
NBoW_model.embedding.weight.data = pretrained_embedding

In [31]:
lr = 5e-4

optimizer = optim.Adam(NBoW_model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [32]:
# cuda 사용함
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cuda')

In [33]:
NBoW_model = NBoW_model.to(device)
criterion = criterion.to(device)

In [34]:
def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

In [35]:
def train_NBoW(data_loader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(data_loader, desc="training..."):
        ids = batch["ids"].to(device)
        label = batch["label"].to(device)
        prediction = model(ids)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [36]:
def evaluate_NBoW(data_loader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(data_loader, desc="evaluating..."):
            ids = batch["ids"].to(device)
            label = batch["label"].to(device)
            prediction = model(ids)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [37]:
n_epochs = 10
best_valid_loss = float("inf")

metrics = collections.defaultdict(list)

for epoch in range(n_epochs):
    train_loss, train_acc = train_NBoW(
        train_data_loader, NBoW_model, criterion, optimizer, device
    )
    valid_loss, valid_acc = evaluate_NBoW(valid_data_loader, NBoW_model, criterion, device)
    metrics["train_losses"].append(train_loss)
    metrics["train_accs"].append(train_acc)
    metrics["valid_losses"].append(valid_loss)
    metrics["valid_accs"].append(valid_acc)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(NBoW_model.state_dict(), "base_nbow.pt")
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

training...: 100%|██████████| 229/229 [00:09<00:00, 24.48it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 30.36it/s]


epoch: 0
train_loss: 0.614, train_acc: 0.740
valid_loss: 0.503, valid_acc: 0.798


training...: 100%|██████████| 229/229 [00:09<00:00, 24.20it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 35.47it/s]


epoch: 1
train_loss: 0.435, train_acc: 0.826
valid_loss: 0.406, valid_acc: 0.831


training...: 100%|██████████| 229/229 [00:09<00:00, 24.02it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 34.67it/s]


epoch: 2
train_loss: 0.367, train_acc: 0.851
valid_loss: 0.374, valid_acc: 0.841


training...: 100%|██████████| 229/229 [00:09<00:00, 24.04it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 34.75it/s]


epoch: 3
train_loss: 0.335, train_acc: 0.863
valid_loss: 0.360, valid_acc: 0.848


training...: 100%|██████████| 229/229 [00:09<00:00, 24.46it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 32.30it/s]


epoch: 4
train_loss: 0.315, train_acc: 0.872
valid_loss: 0.355, valid_acc: 0.851


training...: 100%|██████████| 229/229 [00:10<00:00, 22.47it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 29.36it/s]


epoch: 5
train_loss: 0.301, train_acc: 0.878
valid_loss: 0.354, valid_acc: 0.851


training...: 100%|██████████| 229/229 [00:09<00:00, 24.81it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 34.20it/s]


epoch: 6
train_loss: 0.290, train_acc: 0.882
valid_loss: 0.355, valid_acc: 0.851


training...: 100%|██████████| 229/229 [00:09<00:00, 24.58it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 32.51it/s]


epoch: 7
train_loss: 0.281, train_acc: 0.886
valid_loss: 0.358, valid_acc: 0.850


training...: 100%|██████████| 229/229 [00:09<00:00, 24.57it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 34.39it/s]


epoch: 8
train_loss: 0.275, train_acc: 0.889
valid_loss: 0.362, valid_acc: 0.848


training...: 100%|██████████| 229/229 [00:09<00:00, 25.06it/s]
evaluating...: 100%|██████████| 58/58 [00:01<00:00, 35.44it/s]

epoch: 9
train_loss: 0.269, train_acc: 0.891
valid_loss: 0.366, valid_acc: 0.847





In [38]:
_, test_acc = evaluate_NBoW(test_data_loader, NBoW_model, criterion, device)
print(f"test_acc: {test_acc:.3f}")

evaluating...: 100%|██████████| 97/97 [00:03<00:00, 31.84it/s]

test_acc: 0.842





### LSTM(RNN) 모델

In [39]:
class LSTM(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim,
        hidden_dim,
        output_dim,
        n_layers,
        bidirectional,
        dropout_rate,
        pad_index,
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            n_layers,
            bidirectional=bidirectional,
            dropout=dropout_rate,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, ids, length):
        # ids = [batch size, seq len]
        # length = [batch size]
        embedded = self.dropout(self.embedding(ids))
        # embedded = [batch size, seq len, embedding dim]
        packed_embedded = nn.utils.rnn.pack_padded_sequence(
            embedded, length, batch_first=True, enforce_sorted=False
        )
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        # hidden = [n layers * n directions, batch size, hidden dim]
        # cell = [n layers * n directions, batch size, hidden dim]
        output, output_length = nn.utils.rnn.pad_packed_sequence(packed_output)
        # output = [batch size, seq len, hidden dim * n directions]
        if self.lstm.bidirectional:
            hidden = self.dropout(torch.cat([hidden[-1], hidden[-2]], dim=-1))
            # hidden = [batch size, hidden dim * 2]
        else:
            hidden = self.dropout(hidden[-1])
            # hidden = [batch size, hidden dim]
        prediction = self.fc(hidden)
        # prediction = [batch size, output dim]
        return prediction

In [40]:
vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 128
output_dim = len(train_data.unique("label"))
n_layers = 4
bidirectional = True
dropout_rate = 0.2

lstm_model = LSTM(
    vocab_size,
    embedding_dim,
    hidden_dim,
    output_dim,
    n_layers,
    bidirectional,
    dropout_rate,
    pad_index,
)

In [41]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(lstm_model):,} trainable parameters")

The model has 7,111,526 trainable parameters


In [42]:
def initialize_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight)
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.LSTM):
        for name, param in m.named_parameters():
            if "bias" in name:
                nn.init.zeros_(param)
            elif "weight" in name:
                nn.init.orthogonal_(param)

In [43]:
lstm_model.apply(initialize_weights)

LSTM(
  (embedding): Embedding(18283, 300, padding_idx=1)
  (lstm): LSTM(300, 128, num_layers=4, batch_first=True, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=256, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

In [44]:
lstm_model.embedding.weight.data = pretrained_embedding

In [45]:
lr = 5e-4

optimizer = optim.Adam(lstm_model.parameters(), lr=lr)

In [46]:
criterion = nn.CrossEntropyLoss()

In [47]:
# cuda 사용함
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cuda')

In [48]:
lstm_model = lstm_model.to(device)
criterion = criterion.to(device)

In [49]:
def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

In [50]:
def train(dataloader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(dataloader, desc="training..."):
        ids = batch["ids"].to(device)
        length = batch["length"]
        label = batch["label"].to(device)
        prediction = model(ids, length)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [51]:
def evaluate(dataloader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(dataloader, desc="evaluating..."):
            ids = batch["ids"].to(device)
            length = batch["length"]
            label = batch["label"].to(device)
            prediction = model(ids, length)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

### 모델 학습

In [52]:
n_epochs = 10
best_valid_loss = float("inf")

metrics = collections.defaultdict(list)

for epoch in range(n_epochs):
    train_loss, train_acc = train(
        train_data_loader, lstm_model, criterion, optimizer, device
    )
    valid_loss, valid_acc = evaluate(valid_data_loader, lstm_model, criterion, device)
    metrics["train_losses"].append(train_loss)
    metrics["train_accs"].append(train_acc)
    metrics["valid_losses"].append(valid_loss)
    metrics["valid_accs"].append(valid_acc)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(lstm_model.state_dict(), "base_lstm.pt")
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

training...: 100%|██████████| 229/229 [00:25<00:00,  9.16it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 20.49it/s]


epoch: 0
train_loss: 0.430, train_acc: 0.796
valid_loss: 0.348, valid_acc: 0.848


training...: 100%|██████████| 229/229 [00:24<00:00,  9.51it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 20.44it/s]


epoch: 1
train_loss: 0.310, train_acc: 0.868
valid_loss: 0.349, valid_acc: 0.849


training...: 100%|██████████| 229/229 [00:23<00:00,  9.78it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 20.54it/s]


epoch: 2
train_loss: 0.282, train_acc: 0.882
valid_loss: 0.350, valid_acc: 0.848


training...: 100%|██████████| 229/229 [00:23<00:00,  9.60it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 21.09it/s]


epoch: 3
train_loss: 0.258, train_acc: 0.894
valid_loss: 0.361, valid_acc: 0.846


training...: 100%|██████████| 229/229 [00:24<00:00,  9.41it/s]
evaluating...: 100%|██████████| 58/58 [00:03<00:00, 18.42it/s]


epoch: 4
train_loss: 0.228, train_acc: 0.908
valid_loss: 0.386, valid_acc: 0.842


training...: 100%|██████████| 229/229 [00:24<00:00,  9.36it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 19.82it/s]


epoch: 5
train_loss: 0.195, train_acc: 0.921
valid_loss: 0.409, valid_acc: 0.839


training...: 100%|██████████| 229/229 [00:24<00:00,  9.22it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 19.81it/s]


epoch: 6
train_loss: 0.163, train_acc: 0.934
valid_loss: 0.485, valid_acc: 0.834


training...: 100%|██████████| 229/229 [00:24<00:00,  9.16it/s]
evaluating...: 100%|██████████| 58/58 [00:03<00:00, 18.32it/s]


epoch: 7
train_loss: 0.137, train_acc: 0.945
valid_loss: 0.520, valid_acc: 0.831


training...: 100%|██████████| 229/229 [00:24<00:00,  9.28it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 19.65it/s]


epoch: 8
train_loss: 0.114, train_acc: 0.954
valid_loss: 0.600, valid_acc: 0.827


training...: 100%|██████████| 229/229 [00:24<00:00,  9.44it/s]
evaluating...: 100%|██████████| 58/58 [00:02<00:00, 20.14it/s]

epoch: 9
train_loss: 0.097, train_acc: 0.961
valid_loss: 0.688, valid_acc: 0.827





### 모델 평가

In [53]:
_, test_acc = evaluate(test_data_loader, lstm_model, criterion, device)
print(f"test_acc: {test_acc:.3f}")

evaluating...: 100%|██████████| 97/97 [00:04<00:00, 19.90it/s]

test_acc: 0.820





## User Input


앞에서 전처리한 방법과 동일하다
- 한국어만을 추출
- 불용어 제거
- vocab을 통해 mapping 후, tensor로 변환

In [57]:
def predict_sentiment(model, sentence, min_len=5):
    sentence = re.sub(pattern=r"[^ㄱ-ㅎㅏ-ㅣ가-힣 ]", repl=r"", string=sentence)
    tokens = kor_tokenizer(sentence)
    # tokens = [token for token in tokens if token not in stop_words]
    ids = vocab.lookup_indices(tokens)
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    if model.__class__.__name__==NBoW.__name__:
        prediction = model(tensor).squeeze(dim=0)
    if model.__class__.__name__==LSTM.__name__:
        length = torch.LongTensor([len(ids)])
        prediction = model(tensor, length).squeeze(dim=0)
    probability = torch.softmax(prediction, dim=-1)
    predicted_class = prediction.argmax(dim=-1).item()
    predicted_probability = probability[predicted_class].item()
    return predicted_class, predicted_probability

In [58]:
#정답: 1 / 1 / 1/ 1 / 1 / 0 / 0 / 0 / 0 / 0

s1 = "존잼이다 오컬트 영환데 ㄹㅇ힙함 파묘 보고 김고은과 이도현을 사랑하게 됨ㅠㅠ"
s2 = "배우들 연기 미쳤다.. 몰입하다보니 영화 끝나있음.. 내 기준 올해 한국 영화 중에 탑ㅠ"
s3 = "이런 기이한 이야기에 미술, 의상, 음악이 이렇게 예뻐도 되는 거냐. 스톤과 러팔로도 아카데미 주조연상 후보에 나란히 올라갈 정도의 코미디 연기를 보여줘서 보는 내내 입꼬리가 안 내려갔다"
s4 = "네게 미결로 남고 싶은 내 삶의 흔적"
s5 = "내가 얼마나 '정상적'이고 괜찮은 인간인지 스스로에게, 그리고 타인에게 인정받을 수 있는 가장 쉽고 빠른 방법은 다른 누군가를 괴물이라고 손가락질 해보이는 것이다. 이런식의 괴물 색출, 사냥 놀이에 몰두하는 사회일수록 구성원들의 정신, 사고, 행동은 영화에서처럼 병들고 뒤틀릴 수밖에 없다"
s6 = "명성에 비해 그닥"
s7 = "재밌네요 물론 한번도 웃지는 않았습니다"
s8 = "엥 평론가는 0점 줄수있어요??? 나 cgv vvip 메가박스 mvip 인데 수백편 영화보면서 평론가가 0점 준거 처음봤는데 신기하내요 버그나 오류아니겠죠??평론가 0점 실화에요?? 나는 그래도 솔직히 10점만점에 1.8점 정도인데"
s9 = "맥아리없는 심심한 빌런, 남은건 초롱이뿐"
s10 = "1. 영화 내내 유치하고 재미없는 유머 남발초롱이 캐릭터는 좋았다2. 말이 안 되는 장면 천지초반부터 형사가 일반인 구타?줄빠따 맞고도 아무일 없듯이 칼듯 야쿠자들 관광전편들과 달리 형사로서의 작전 능력 없이 모든 과정이 우연처럼 성공적으로 풀린다3. 약한 빌런임팩트 있게 등장하지만 실상 까보면 아무것도 없음4. 마동석 원맨쇼동료형사들 아예 안 나왔어도 무방할듯광수대로 가면서 동료를 다 버렸다"


- NBoW, LSTM 성능

In [59]:
reviews = {"review":[s1, s2, s3, s4, s5, s6, s7, s8, s9, s10],
           "answer":[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]}
NBoW_score, lstm_score = [], []

for index, review in enumerate(reviews['review']):
    NBoW_pred_class, NBoW_pred_prob = predict_sentiment(NBoW_model, reviews['review'][index])
    lstm_pred_class, lstm_pred_prob = predict_sentiment(lstm_model, reviews['review'][index])
    if NBoW_pred_class==reviews['answer'][index]:
        
        print("NBoW: "+"%10s"%f"correct: \tpred({NBoW_pred_class}), answer({reviews['answer'][index]})")
        NBoW_score.append(1)
    else:
        print("NBoW: "+"%10s"%f"wrong: \tpred({NBoW_pred_class}), answer({reviews['answer'][index]})")
        NBoW_score.append(0)
    if lstm_pred_class==reviews['answer'][index]:
        print("LSTM: "+"%10s"%f"correct: \tpred({lstm_pred_class}), answer({reviews['answer'][index]})")
        lstm_score.append(1)
    else:
        print("LSTM: "+"%10s"%f"wrong: \tpred({lstm_pred_class}), answer({reviews['answer'][index]})")
        lstm_score.append(0)
print(f"NBoW model: {np.mean(NBoW_score)*100:.0f}점")
print(f"LSTM model: {np.mean(lstm_score)*100:.0f}점")

NBoW: correct: 	pred(1), answer(1)
LSTM: correct: 	pred(1), answer(1)
NBoW: correct: 	pred(1), answer(1)
LSTM: correct: 	pred(1), answer(1)
NBoW: wrong: 	pred(0), answer(1)
LSTM: correct: 	pred(1), answer(1)
NBoW: correct: 	pred(1), answer(1)
LSTM: correct: 	pred(1), answer(1)
NBoW: correct: 	pred(1), answer(1)
LSTM: correct: 	pred(1), answer(1)
NBoW: correct: 	pred(0), answer(0)
LSTM: correct: 	pred(0), answer(0)
NBoW: wrong: 	pred(1), answer(0)
LSTM: wrong: 	pred(1), answer(0)
NBoW: wrong: 	pred(1), answer(0)
LSTM: wrong: 	pred(1), answer(0)
NBoW: correct: 	pred(0), answer(0)
LSTM: correct: 	pred(0), answer(0)
NBoW: correct: 	pred(0), answer(0)
LSTM: correct: 	pred(0), answer(0)
NBoW model: 70점
LSTM model: 80점
