https://moondol-ai.tistory.com/463 (BERT의 진행 방식)
문장이 긍정인지 부정인지 판단하기 (감성 분류 태스크)  
하나의 텍스트에 대한 텍스트 분류 유형(Single Text Classification) 임  
기존 BERT 사용X : 분류를 원하는 데이터 -> LSTM, CNN 등의 머신러닝 모델 -> 분류  
BERT 사용O : 코퍼스(자연어 처리 모델을 학습시키기 위한 데이터) -> BERT -> 분류를 원하는 데이터 -> LSTM, CNN 등의 머신러닝 모델 -> 분류  
코퍼스에 BERT를 적용하고 얻은 좋은 임베딩값을 모델에 입력(전이 학습) 하는 구조  
BERT는 3.3억 단어의 코퍼스를 학습한 모델, 스스로 라벨을 만들고 준지도학습으로 수행  

In [1]:
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from torchnlp.datasets import imdb_dataset

import pandas as pd
import numpy as np
import random as rn
import time
import datetime

In [2]:
import os

n_devices = torch.cuda.device_count()
print(n_devices)

for i in range(n_devices):
    print(torch.cuda.get_device_name(i))

1
NVIDIA GeForce RTX 2070


In [3]:
# imdb데이터 사용, 순차적인 데이터를 셔플
train, test = imdb_dataset(train=True, test=True)
rn.shuffle(train)
rn.shuffle(test)

train = train[:2000]
test = test[:200]

train = pd.DataFrame(train)
test = pd.DataFrame(test)

# 라벨을 0과 1로 변경
change = {'neg' : 0, 'pos' : 1}
train = train.replace({'sentiment' : change})
test = test.replace({'sentiment' : change})

print(train.shape)
print(test.shape)

(2000, 2)
(200, 2)


In [4]:
# 문장의 시작은 CLS, 끝은 SEP를 추가하여 표시함
document_bert = ["[CLS] " + str(s) + " [SEP]" for s in train.text]
document_bert[:5]

["[CLS] This is pretty much the first Jason Scott Lee film I've seen. I say pretty much, because I have also seen Soldier, in which he plays the villain... but from what I've heard, it's not considered a Jason Scott Lee film. This, however, is. And if this is any indication of the quality of such films, I won't be seeing any of the others. Lee is basically passable as a martial arts artist... as the lead, he's awful. He gets in a fight with random no-name characters every few minutes of the film, probably because the script writer couldn't figure out how else to stretch out the film to the minimum required running time for a feature film. The villain is the only character with even a hint of personality, and aside from the fact that he's certifiably insane, he barely seems like a villain at all. The majority of the film is basically Lee chasing the villain through time... or maybe it's the other way around. I can't say for sure... and I definitely wouldn't watch it again to make sure. 

In [5]:
# 사전 학습되어 있는 Bert모델의 Tokenizer를 이용하여 문장을 토큰화 시킴
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(s) for s in document_bert]
print(tokenized_texts[0])

['[CLS]', 'This', 'is', 'pretty', 'much', 'the', 'first', 'Jason', 'Scott', 'Lee', 'film', 'I', "'", 've', 'seen', '.', 'I', 'say', 'pretty', 'much', ',', 'because', 'I', 'have', 'also', 'seen', 'Soldier', ',', 'in', 'which', 'he', 'plays', 'the', 'villa', '##in', '.', '.', '.', 'but', 'from', 'what', 'I', "'", 've', 'heard', ',', 'it', "'", 's', 'not', 'considered', 'a', 'Jason', 'Scott', 'Lee', 'film', '.', 'This', ',', 'however', ',', 'is', '.', 'And', 'if', 'this', 'is', 'any', 'indication', 'of', 'the', 'quality', 'of', 'such', 'films', ',', 'I', 'won', "'", 't', 'be', 'seeing', 'any', 'of', 'the', 'others', '.', 'Lee', 'is', 'basic', '##ally', 'passa', '##ble', 'as', 'a', 'martial', 'arts', 'artist', '.', '.', '.', 'as', 'the', 'lead', ',', 'he', "'", 's', 'aw', '##ful', '.', 'He', 'gets', 'in', 'a', 'fight', 'with', 'random', 'no', '-', 'name', 'characters', 'every', 'few', 'minutes', 'of', 'the', 'film', ',', 'probably', 'because', 'the', 'script', 'writer', 'couldn', "'", 't',

In [6]:
# MAX_LEN = 문장의 최대 길이보다 큰 수, 최대가 512
MAX_LEN = 512
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
input_ids[0]

array([   101,  10747,  10124, 108361,  13172,  10105,  10422,  16796,
        12812,  12006,  10458,    146,    112,  10323,  15652,    119,
          146,  23763, 108361,  13172,    117,  12373,    146,  10529,
        10379,  15652,  50162,    117,  10106,  10319,  10261,  17724,
        10105,  19863,  10245,    119,    119,    119,  10473,  10188,
        12976,    146,    112,  10323,  32240,    117,  10271,    112,
          187,  10472,  14289,    169,  16796,  12812,  12006,  10458,
          119,  10747,    117,  13800,    117,  10124,    119,  12689,
        12277,  10531,  10124,  11178, 102383,  10108,  10105,  21905,
        10108,  11049,  14280,    117,    146,  11367,    112,    188,
        10347,  57039,  11178,  10108,  10105,  14633,    119,  12006,
        10124,  25090,  19777,  21323,  11203,  10146,    169,  65187,
        17045,  16410,    119,    119,    119,  10146,  10105,  14107,
          117,  10261,    112,    187,  56237,  14446,    119,  10357,
      

In [7]:
# 학습 속도를 높이기 위해 실제 데이터가 있는 곳을 1, 패딩된 곳을 0으로 표시한 mask
attention_masks = []
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)
print(attention_masks[0])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,

In [8]:
# 학습용 데이터를 학습용, 검증용 두가지로 분리
# 모델이 과적합되지 않게 함
train_inputs, validation_inputs, train_labels, validation_labels = \
train_test_split(input_ids, train['sentiment'].values, random_state=42, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, 
                                                       input_ids,
                                                       random_state=42, 
                                                       test_size=0.1)

In [9]:
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)

In [10]:
BATCH_SIZE = 4

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=BATCH_SIZE)

In [11]:
# test data도 똑같이 전처리
sentences = test['text']
sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in sentences]
labels = test['sentiment'].values

tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

attention_masks = []
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)
    
test_inputs = torch.tensor(input_ids)
test_labels = torch.tensor(labels)
test_masks = torch.tensor(attention_masks)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)


In [12]:
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: NVIDIA GeForce RTX 2070


In [13]:
# BERT 모델 생성, 이진 분류 이므로 num_labels = 2
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=2)
# BERT 모델 정보를 자세히 볼 수 있음
# 
model.cuda()

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [14]:
# 옵티마이저 설정
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # 학습률
                  eps = 1e-8 # 0으로 나누는 것을 방지하기 위한 epsilon 값
                )

# 에폭수
epochs = 4

# 총 훈련 스텝
total_steps = len(train_dataloader) * epochs

# lr 조금씩 감소시키는 스케줄러
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)
#워닝은 무시



In [15]:
# 정확도 계산 함수
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# 시간 표시 함수
def format_time(elapsed):
    # 반올림
    elapsed_rounded = int(round((elapsed)))
    # hh:mm:ss으로 형태 변경
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [16]:
for step, batch in enumerate(train_dataloader):
    b_input_ids, b_input_mask, b_labels = batch
    print(b_input_ids, b_input_mask, b_labels)
    break

tensor([[  101, 14535, 10531,  ...,     0,     0,     0],
        [  101, 10747, 18379,  ...,     0,     0,     0],
        [  101, 11301,   107,  ...,     0,     0,     0],
        [  101, 10772, 40751,  ...,     0,     0,     0]], dtype=torch.int32) tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]]) tensor([0, 0, 1, 0])


In [17]:
# 재현을 위해 랜덤시드 고정
seed_val = 42
rn.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# 그래디언트 초기화
model.zero_grad()

# 에폭만큼 반복
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # 시작 시간 설정
    t0 = time.time()

    # 로스 초기화
    total_loss = 0

    # 훈련모드로 변경
    model.train()
        
    # 데이터로더에서 배치만큼 반복하여 가져옴
    for step, batch in enumerate(train_dataloader):
        # 경과 정보 표시
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch

        # Forward 수행                
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)
        
        # 로스 구함
        loss = outputs[0]

        # 총 로스 계산
        total_loss += loss.item()

        # Backward 수행으로 그래디언트 계산
        loss.backward()

        # 그래디언트 클리핑
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # 그래디언트를 통해 가중치 파라미터 업데이트
        optimizer.step()

        # 스케줄러로 학습률 감소
        scheduler.step()

        # 그래디언트 초기화
        model.zero_grad()

    # 평균 로스 계산
    avg_train_loss = total_loss / len(train_dataloader)            

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    #시작 시간 설정
    t0 = time.time()

    # 평가모드로 변경
    model.eval()

    # 변수 초기화
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # 데이터로더에서 배치만큼 반복하여 가져옴
    for batch in validation_dataloader:
        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch
        
        # 그래디언트 계산 안함
        with torch.no_grad():     
            # Forward 수행
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # 로스 구함
        logits = outputs[0]

        # CPU로 데이터 이동
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # 출력 로짓과 라벨을 비교하여 정확도 계산
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...

  Average training loss: 0.61
  Training epcoh took: 0:02:15

Running Validation...
  Accuracy: 0.74
  Validation took: 0:00:05

Training...

  Average training loss: 0.47
  Training epcoh took: 0:02:19

Running Validation...
  Accuracy: 0.85
  Validation took: 0:00:05

Training...

  Average training loss: 0.25
  Training epcoh took: 0:02:19

Running Validation...
  Accuracy: 0.89
  Validation took: 0:00:05

Training...

  Average training loss: 0.13
  Training epcoh took: 0:02:20

Running Validation...
  Accuracy: 0.90
  Validation took: 0:00:05

Training complete!


In [19]:
torch.save(model, 'bert_model')

In [57]:
#시작 시간 설정
t0 = time.time()

# 평가모드로 변경
model.eval()

# 변수 초기화
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# 데이터로더에서 배치만큼 반복하여 가져옴
for step, batch in enumerate(test_dataloader):
    # 경과 정보 표시
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))

    # 배치를 GPU에 넣음
    batch = tuple(t.to(device) for t in batch)
    
    # 배치에서 데이터 추출
    b_input_ids, b_input_mask, b_labels = batch
    
    # 그래디언트 계산 안함
    with torch.no_grad():     
        # Forward 수행
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask)
    
    # 로스 구함
    logits = outputs[0]
    print(type(b_input_ids), batch)
    # CPU로 데이터 이동
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # 출력 로짓과 라벨을 비교하여 정확도 계산
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

print("")
print("Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print("Test took: {:}".format(format_time(time.time() - t0)))

<class 'torch.Tensor'> (tensor([[  101, 10117, 17975,  ..., 16013, 10108, 35655],
        [  101, 47168, 14234,  ...,     0,     0,     0],
        [  101, 11065, 21852,  ..., 10485,   117, 10473],
        [  101, 38094, 10111,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([1, 0, 0, 0], device='cuda:0'))
<class 'torch.Tensor'> (tensor([[  101, 29922, 10230,  ..., 20337, 31748, 10105],
        [  101, 10167, 15127,  ...,     0,     0,     0],
        [  101, 99946, 40430,  ...,     0,     0,     0],
        [  101, 33799, 33666,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0'), t

<class 'torch.Tensor'> (tensor([[  101, 12489, 10124,  ...,     0,     0,     0],
        [  101, 10747, 10134,  ...,     0,     0,     0],
        [  101, 10747, 10124,  ...,     0,     0,     0],
        [  101, 65272, 47116,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([0, 0, 1, 0], device='cuda:0'))
<class 'torch.Tensor'> (tensor([[  101,   146, 92147,  ...,     0,     0,     0],
        [  101,   146, 10134,  ...,     0,     0,     0],
        [  101,   146, 34420,  ...,     0,     0,     0],
        [  101, 15480, 10631,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0'), t

<class 'torch.Tensor'> (tensor([[   101,    107, 109782,  ...,      0,      0,      0],
        [   101,  69699,  47234,  ...,      0,      0,      0],
        [   101,  74138,  13028,  ...,      0,      0,      0],
        [   101,  10747,  18379,  ...,      0,      0,      0]],
       device='cuda:0', dtype=torch.int32), tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([1, 1, 0, 0], device='cuda:0'))
<class 'torch.Tensor'> (tensor([[  101,   146, 10392,  ...,     0,     0,     0],
        [  101, 19318,   146,  ...,     0,     0,     0],
        [  101, 10747, 10458,  ...,     0,     0,     0],
        [  101, 10747, 10124,  ...,     0,     0,     0]], device='cuda:0',
       dtype=torch.int32), tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0

In [73]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\haeji\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haeji\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\haeji\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [83]:
## 1. 트윗 전처리

file_dir = './data'
df = pd.DataFrame()

# 디렉토리에서 트윗 파일 불러오기
for file_name in os.listdir(file_dir):
    # 트윗 파일이 아니면 넘김
    #if ('PLTR' not in file_name) and ('palantir' not in file_name):
    if 'stock' in file_name:
        print(f"FIle skipped (not tweet): {file_name}")
        continue

    file_path = os.path.join(file_dir, file_name)
    
    # 트윗 파일 로드 및 병합
    try:
        file = pd.read_csv(file_path, index_col=0)
        if len(file) == 0:
            continue
        df = df.append(file)

    except Exception as error_message:
        print(f'error: {error_message}')
        
# 인덱스 초기화
df = df.reset_index(drop=True)
df = df.rename(columns={'text':'tweet'})
print(df)

error: [Errno 13] Permission denied: './data\\.ipynb_checkpoints'
error: [Errno 13] Permission denied: './data\\aclImdb'
       search_word                     user  \
0         GameStop              Marcus Penn   
1         GameStop  h0bbes. Yeah, that one.   
2         GameStop                Omer Zach   
3         GameStop             Bill English   
4         GameStop                     Mae.   
...            ...                      ...   
102531       Tesla            Martin Burris   
102532       Tesla           UNICORN FOR PM   
102533       Tesla   Blocked by Sawyer Club   
102534       Tesla                  News 10   
102535       Tesla             Dan Burkland   

                                                    tweet  \
0       Yay geeking out! (@ GameStop) http://4sq.com/a...   
1       I'm at GameStop (8336 Agora Pkwy, The Forum, S...   
2       I just ousted \n@braisinhussy\n as the mayor o...   
3       I'm at Gamestop in Trenton, NJ http://gowal.la...   
4       I

In [84]:
def clean_tweet(tweet):
    tweet = str(tweet)
    tweet = tweet.lower() # 소문자로 바꾸기
    tweet = re.sub('https?:\/\/[a-zA-Z0-9@:%._\/+~#=?&;-]*', ' ', tweet) # URL 제거
    tweet = re.sub('\$[a-zA-Z0-9]*', ' ', tweet) # ticker symbol($로 시작하는 주식 관련 심볼) 제거
    tweet = re.sub('\@[a-zA-Z0-9]*', ' ', tweet) # 유저 호출하는 기능(@로 시작) 제거
    tweet = re.sub('[^a-zA-Z\']', ' ', tweet) # 문자가 아닌 것 제거
    tweet = ' '.join( [w for w in tweet.split() if len(w)>1] )
    
    tweet = ' '.join([lemma.lemmatize(x) for x in nltk.wordpunct_tokenize(tweet) if x not in stop_words])
    tweet = [lemma.lemmatize(x, nltk.corpus.reader.wordnet.VERB) for x in nltk.wordpunct_tokenize(tweet) if x not in stop_words]
    return tweet 

In [85]:
lemma = WordNetLemmatizer()
stop_words = stopwords.words("english")

In [86]:
# 트윗을 토큰화시킨 것, 그리고 토큰을 이어붙인 것을 새로운 열에 추가
df["clean_tweet"] = df["tweet"].apply(lambda x:clean_tweet(x))
df["cleaned_tweet"] = df["clean_tweet"].apply(lambda x:' '.join(x))

In [87]:
df = df[df['search_word'] == 'Tesla'] # 주식 이름 선택

In [88]:
tweet_data = df['tweet']

In [89]:
tweet_data

8004                       Dads new car tesla model x\n1\n2
8005      What an amazing machine! resolution to GoGreen...
8006      Tesla appears to be suffering from to much #Ne...
8007      Shahed: what if tesla had a New Year sale and ...
8008      Tesla Model S Burns To A Crisp During Supercha...
                                ...                        
102531    To anyone who can read this tweet and is a gun...
102532    Tesla is on . Literally. Again!\nJulia Bagg\n@...
102533    NEWS: Tesla has filed a rule change for grid o...
102534    "Luckily, it didn't happen while we were drivi...
102535    For those paying attention, this also confirms...
Name: tweet, Length: 94532, dtype: object

In [90]:
def pred_bert(tweet_data):
    document_bert = ["[CLS] " + str(s) + " [SEP]" for s in tweet_data]
    tokenized_texts = [tokenizer.tokenize(s) for s in document_bert]
    MAX_LEN = 512
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
    attention_masks = []
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)
    train_inputs = torch.tensor(input_ids)
    train_labels = torch.zeros(len(input_ids)) # 라벨이 없는 트위터 데이터므로 0을 채워넣음, 트위터 개수
    train_masks = torch.tensor(attention_masks)
    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)
    
    model.eval()
    # 변수 초기화
    pred = []
    # 데이터로더에서 배치만큼 반복하여 가져옴
    for step, batch in enumerate(train_dataloader):
        # 경과 정보 표시
        if step % 100 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)

        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch

        # 그래디언트 계산 안함
        with torch.no_grad():     
            # Forward 수행
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)

        # 로스 구함
        logits = outputs[0]

        # CPU로 데이터 이동
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        #예측된 값을 저장
        pred += np.argmax(logits, axis=1).flatten().tolist()
    return pred

In [91]:
print(pred_bert(tweet_data))

  Batch   100  of  23,633.    Elapsed: 0:20:59.
  Batch   200  of  23,633.    Elapsed: 0:21:07.
  Batch   300  of  23,633.    Elapsed: 0:21:16.
  Batch   400  of  23,633.    Elapsed: 0:21:25.
  Batch   500  of  23,633.    Elapsed: 0:21:33.
  Batch   600  of  23,633.    Elapsed: 0:21:42.
  Batch   700  of  23,633.    Elapsed: 0:21:51.
  Batch   800  of  23,633.    Elapsed: 0:21:59.
  Batch   900  of  23,633.    Elapsed: 0:22:09.
  Batch 1,000  of  23,633.    Elapsed: 0:22:18.
  Batch 1,100  of  23,633.    Elapsed: 0:22:27.
  Batch 1,200  of  23,633.    Elapsed: 0:22:36.
  Batch 1,300  of  23,633.    Elapsed: 0:22:46.
  Batch 1,400  of  23,633.    Elapsed: 0:22:55.
  Batch 1,500  of  23,633.    Elapsed: 0:23:04.
  Batch 1,600  of  23,633.    Elapsed: 0:23:14.
  Batch 1,700  of  23,633.    Elapsed: 0:23:23.
  Batch 1,800  of  23,633.    Elapsed: 0:23:32.
  Batch 1,900  of  23,633.    Elapsed: 0:23:42.
  Batch 2,000  of  23,633.    Elapsed: 0:23:51.
  Batch 2,100  of  23,633.    Elapsed: 0

  Batch 17,100  of  23,633.    Elapsed: 0:47:13.
  Batch 17,200  of  23,633.    Elapsed: 0:47:22.
  Batch 17,300  of  23,633.    Elapsed: 0:47:31.
  Batch 17,400  of  23,633.    Elapsed: 0:47:41.
  Batch 17,500  of  23,633.    Elapsed: 0:47:50.
  Batch 17,600  of  23,633.    Elapsed: 0:47:59.
  Batch 17,700  of  23,633.    Elapsed: 0:48:09.
  Batch 17,800  of  23,633.    Elapsed: 0:48:18.
  Batch 17,900  of  23,633.    Elapsed: 0:48:27.
  Batch 18,000  of  23,633.    Elapsed: 0:48:37.
  Batch 18,100  of  23,633.    Elapsed: 0:48:46.
  Batch 18,200  of  23,633.    Elapsed: 0:48:55.
  Batch 18,300  of  23,633.    Elapsed: 0:49:05.
  Batch 18,400  of  23,633.    Elapsed: 0:49:14.
  Batch 18,500  of  23,633.    Elapsed: 0:49:23.
  Batch 18,600  of  23,633.    Elapsed: 0:49:32.
  Batch 18,700  of  23,633.    Elapsed: 0:49:42.
  Batch 18,800  of  23,633.    Elapsed: 0:49:51.
  Batch 18,900  of  23,633.    Elapsed: 0:50:00.
  Batch 19,000  of  23,633.    Elapsed: 0:50:10.
  Batch 19,100  of  