<a href="https://colab.research.google.com/github/mongbro/colab/blob/main/09_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### keras RNN으로 BBC 기사 분류하기

1. 패키지 수입 및 파라미터 지정

In [None]:
# 패키지 수입
import numpy as np
import csv
import nltk # natural language tool kit

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Dropout, Embedding
from keras.layers import Bidirectional
from time import time
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split

In [None]:
# 파라미터 지정
MY_VOCAB = 5000   # 내가 사용할 단어의 수, 제일 많이 사용된 단어
MY_EMBED = 64     # 임베딩 차원
MY_HIDDEN = 100   # LSTM 셀의 규모
MY_LEN = 200      # 기사의 길이
# 원본 => 5000, 64, 100, 200

MY_SPLIT = 0.8    # 학습용 데이터의 비율
MY_SAMPLE = 123   # 샘플 기사
MY_EPOCH = 100    # 반복 학습 수
TRAIN_MODE = 1    # 학습 모드와 평가 모드 선택
# 원본 => 0.8, 123, 100, 1

2. 데이터 처리

In [None]:
# 제외어 (stopword) 설정
nltk.download('stopwords')
MY_STOP = set(nltk.corpus.stopwords.words('english'))

print('영어 단어 제외')
print(MY_STOP)
print('제외어 개수 :', len(MY_STOP))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
영어 단어 제외
{'won', 'has', 'was', 'will', 'to', 'weren', 'in', 'all', "you'll", 'our', "couldn't", 'you', 'before', 'having', 'being', 'about', 'ma', 'myself', 'some', "should've", 'its', 'were', 'been', 'over', 'didn', 'under', 'shouldn', 'against', "you'd", 'am', 'what', 'hadn', "you've", 'does', 'no', 'too', 'now', 'same', 'from', 'while', 'again', 'during', 'both', "weren't", 'other', 'he', 'the', "didn't", 'most', 'those', 'who', 'wasn', 'why', 'if', 'ourselves', 'll', 'because', "aren't", 'his', 'couldn', 'it', 'hers', 'which', 'but', 'here', 'y', "that'll", "you're", 'for', 'any', 'than', "isn't", 'above', 'her', 'more', 'shan', "won't", 'into', "haven't", 'down', 'him', 've', "doesn't", 't', 'just', 'yours', 'through', 'an', "hasn't", 're', 'how', 'haven', 'me', 'm', 'doing', "hadn't", "mightn't", 'wouldn', 'i', 'of', 'my', 'a', 'at', 'out', 'after', 'that', 'with

In [None]:
# 데이터 보관 창고
original = []
articles = []
labels = []

In [None]:
# BBC 파일 읽고 처리
with open('/content/drive/MyDrive/Colab Notebooks/data/bbc-text.csv', 'r') as file:
    # 칼럼 이름 읽기
    reader = csv.reader(file)
    next(reader)

    # 기사 하나씩 처리
    for row in reader:
        # 카테고리 저장
        labels.append(row[0])

        # 원본 기사 저장
        original.append(row[1])

        # 제외어 삭제 하기
        news = row[1]
        for word in MY_STOP:
            mask = ' ' + word + ' '
            news = news.replace(mask, ' ')
        # 제외어를 뺀 기사 저장
        articles.append(news)
        
print('처리한 기사 수 :', len(articles))

처리한 기사 수 : 2225


In [None]:
# 샘플 기사 출력
print('샘플 기사 원본 >> ')
print(original[MY_SAMPLE])
print(labels[MY_SAMPLE])
print('총 단어 수 :', len(original[MY_SAMPLE].split()))

샘플 기사 원본 >> 
screensaver tackles spam websites net users are getting the chance to fight back against spam websites  internet portal lycos has made a screensaver that endlessly requests data from sites that sell the goods and services mentioned in spam e-mail. lycos hopes it will make the monthly bandwidth bills of spammers soar by keeping their servers running flat out. the net firm estimates that if enough people sign up and download the tool  spammers could end up paying to send out terabytes of data.   we ve never really solved the big problem of spam which is that its so damn cheap and easy to do   said malte pollmann  spokesman for lycos europe.  in the past we have built up the spam filtering systems for our users   he said   but now we are going to go one step further.    we ve found a way to make it much higher cost for spammers by putting a load on their servers.  by getting thousands of people to download and use the screensaver  lycos hopes to get spamming websites constant

In [None]:
# 제외어 처리 결과
print('샘플 기사 제외어 삭제본 >> ')
print(articles[MY_SAMPLE])
print('총 단어 수 :', len(articles[MY_SAMPLE].split()))

샘플 기사 제외어 삭제본 >> 
screensaver tackles spam websites net users getting chance fight back spam websites  internet portal lycos made screensaver endlessly requests data sites sell goods services mentioned spam e-mail. lycos hopes make monthly bandwidth bills spammers soar keeping servers running flat out. net firm estimates enough people sign download tool  spammers could end paying send terabytes data.   never really solved big problem spam damn cheap easy   said malte pollmann  spokesman lycos europe.  past built spam filtering systems users   said   going go one step further.    found way make much higher cost spammers putting load servers.  getting thousands people download use screensaver  lycos hopes get spamming websites constantly running almost full capacity. mr pollmann said intention stop spam websites working subjecting much data cope with. said screensaver carefully written ensure amount traffic generated user overload web.  every single user contribute three four megabytes p

In [None]:
# Tokenizer 처리
A_token = Tokenizer(num_words = MY_VOCAB,
                    oov_token = 'oov')
# oov란? 제외되지 않은 단어 중에서 사용 빈도가 적어서 5000개 단어에 포함하지 않는 단어들
#                        MY_VOCAB가 적어질수록 oov가 늘어난다

A_token.fit_on_texts(articles)
A_tokenized = A_token.texts_to_sequences(articles)  # => 텍스트를 숫자로 변환(hash function)

# 전환의 예
print(A_token.sequences_to_texts([[1]]))      # 1은 어떤 단어인가? => 'oov'(생략된 단어)
                                              # MY_VOCAB가 적어질수록 1이 늘어난다
print(A_token.sequences_to_texts([[1259]]))   # 1140은 어떤 단어인가? => 'the'
print(A_token.texts_to_sequences(['the']))    # 'the'는 어떤 숫자인가? => 1173
print(A_token.texts_to_sequences(['oov']))    # 'oov'는 어떤 숫자인가? => 1

['oov']
['welsh']
[[1219]]
[[1]]


In [None]:
# Token  처리 결과 출력
sample = A_tokenized[MY_SAMPLE]
print(sample)

[3171, 1, 816, 878, 115, 136, 382, 347, 716, 28, 816, 878, 228, 1, 3172, 27, 3171, 1, 4868, 203, 568, 733, 1771, 126, 4025, 816, 260, 395, 3172, 700, 21, 1647, 3629, 2849, 2607, 1, 2326, 2551, 453, 2919, 569, 115, 63, 2291, 381, 7, 1160, 780, 1860, 2607, 11, 92, 1571, 1051, 1, 203, 281, 154, 1, 138, 364, 816, 1, 2225, 847, 2, 1, 1, 178, 3172, 139, 255, 1109, 816, 1, 726, 136, 2, 52, 60, 10, 818, 3792, 195, 41, 21, 56, 494, 245, 2607, 1363, 1, 2551, 382, 1021, 7, 780, 70, 3171, 3172, 700, 23, 1, 878, 3993, 453, 343, 322, 1394, 3, 1, 2, 3428, 582, 816, 878, 297, 1, 56, 203, 2297, 2404, 2, 3171, 2709, 1069, 660, 812, 1287, 3885, 1539, 1, 466, 224, 503, 1539, 1, 31, 96, 1, 681, 111, 2, 10, 1899, 912, 2, 381, 7, 1160, 1, 878, 11, 722, 256, 1, 1287, 224, 503, 111, 3172, 79, 70, 260, 395, 716, 28, 2, 3, 1, 4, 1604, 10, 823, 455, 158, 823, 455, 2, 568, 2179, 4025, 816, 260, 395, 891, 733, 1771, 126, 220, 3677, 568, 316, 86, 1051, 816, 260, 395, 3677, 23, 1, 1453, 681, 111, 415, 568, 3171, 760,

In [None]:
# 기사 통계 내기
# 제외어 빼고 제일 긴, 짧은 기사 구하기
longest = max([len(x) for x in A_tokenized])
shortest = min([len(x) for x in A_tokenized])

print('제일 긴 기사 :', longest)
print('제일 짧은 기사 :', shortest)

# 모든 기사에서 제외어를 빼고 사용된 모든 단어 수
print('총 단어 수 :', len(A_token.word_counts))

제일 긴 기사 : 2279
제일 짧은 기사 : 50
총 단어 수 : 29698


In [None]:
# 기사 길이 맞추기
# MY_LEN보다 긴건 자르고 짧은건 무언가(0)를 더해준다
A_tokenized = pad_sequences(A_tokenized,
                            maxlen = MY_LEN,
                            padding = 'post',     # 200단어보다 짧은 기사는 뒷부분을 0으로 패딩처리
                            truncating = 'post')  # 200단어보다 긴 기사는 뒷부분 삭제

# 기사 길이 확인
longest = max([len(x) for x in A_tokenized])
shortest = min([len(x) for x in A_tokenized])

print('제일 긴 기사 :', longest)
print('제일 짧은 기사 :', shortest)

제일 긴 기사 : 200
제일 짧은 기사 : 200


In [None]:
# 라벨 tokenization
C_token = Tokenizer()
C_token.fit_on_texts(labels)      # hash function
C_tokenized = C_token.texts_to_sequences(labels)

# 전환의 예
print(C_token.word_index)
print(C_tokenized)

{'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}
[[4], [2], [1], [1], [5], [3], [3], [1], [1], [5], [5], [2], [2], [3], [1], [2], [3], [1], [2], [4], [4], [4], [1], [1], [4], [1], [5], [4], [3], [5], [3], [4], [5], [5], [2], [3], [4], [5], [3], [2], [3], [1], [2], [1], [4], [5], [3], [3], [3], [2], [1], [3], [2], [2], [1], [3], [2], [1], [1], [2], [2], [1], [2], [1], [2], [4], [2], [5], [4], [2], [3], [2], [3], [1], [2], [4], [2], [1], [1], [2], [2], [1], [3], [2], [5], [3], [3], [2], [5], [2], [1], [1], [3], [1], [3], [1], [2], [1], [2], [5], [5], [1], [2], [3], [3], [4], [1], [5], [1], [4], [2], [5], [1], [5], [1], [5], [5], [3], [1], [1], [5], [3], [2], [4], [2], [2], [4], [1], [3], [1], [4], [5], [1], [2], [2], [4], [5], [4], [1], [2], [2], [2], [4], [1], [4], [2], [1], [5], [1], [4], [1], [4], [3], [2], [4], [5], [1], [2], [3], [2], [5], [3], [3], [5], [3], [2], [5], [3], [3], [5], [3], [1], [2], [3], [3], [2], [5], [1], [2], [2], [1], [4], [1], [4], [4], 

In [None]:
# 데이터 4분할
C_tokenized = np.array(C_tokenized)   # 기존의 C_tokenized는 list형식이다
X_train, X_test, Y_train, Y_test = train_test_split(A_tokenized,
                                                    C_tokenized,
                                                    train_size = MY_SPLIT,
                                                    shuffle = False)

# 데이터 모양 확인
print('학습용 입력 데이터 모양 :', X_train.shape)
print('학습용 출력 데이터 모양 :', Y_train.shape)

print('평가용 입력 데이터 모양 :', X_test.shape)
print('평가용 출력 데이터 모양 :', Y_test.shape)

학습용 입력 데이터 모양 : (1780, 200)
학습용 출력 데이터 모양 : (1780, 1)
평가용 입력 데이터 모양 : (445, 200)
평가용 출력 데이터 모양 : (445, 1)


3. 인공 신경망 구현

In [None]:
# RNN 구현
model = Sequential()

model.add(Embedding(input_dim = MY_VOCAB,       # 1 * 5000 행렬에 5000 * 64행렬을 곱해서
                    output_dim = MY_EMBED))     # 1 * 64 행렬로 만든다.

model.add(Dropout(rate = 0.5))  # 임의의 뉴런의 출력을 일부러 0으로 만드는 작업
                                # 왜? => 과적합을 방지하기 위해서

model.add(Bidirectional(LSTM(units = MY_HIDDEN)))

model.add(Dense(units = 6,                  # 왜 5가 아니라 6일까??
                activation = 'softmax'))    # 아까 output은 1~5였는데 RNN에서 맨 처음은 0이라서
                                            # units를 5로 하면 0~4 까지만 검색을 한다.
                                            # 즉 5번이 나올 수 없다.
print('RNN 요약')
model.summary()

RNN 요약
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          320000    
_________________________________________________________________
dropout (Dropout)            (None, None, 64)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               132000    
_________________________________________________________________
dense (Dense)                (None, 6)                 1206      
Total params: 453,206
Trainable params: 453,206
Non-trainable params: 0
_________________________________________________________________


4. 인공 신경망 학습

In [None]:
# RNN 학습
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
              metrics = ['acc'])

print('학습 시작')
begin = time()

model.fit(x = X_train,
          y = Y_train,
          epochs = MY_EPOCH,
          verbose = 1)

end = time()
print('학습시간 : {:.2f}초'.format(end - begin))

학습 시작
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Ep

5. 인공 신경망 평가

In [None]:
# RNN 평가
score = model.evaluate(X_test, Y_test,
                       verbose = 0)

print('최종 손실값 : {:.2f}'.format(score[0]))
print('최종 정확도 : {:.2f}'.format(score[1]))

최종 손실값 : 0.20
최종 정확도 : 0.96


6. 인공 신경망 예측

In [None]:
# RNN 예측
pred = model.predict(X_test)
pred = pred.argmax(axis = 1)
print(pred)
print(Y_test.flatten())

[5 4 3 1 1 4 2 4 3 5 3 3 2 5 1 5 5 2 1 3 4 2 1 5 4 3 3 1 1 3 2 2 2 2 5 2 3
 3 4 4 5 3 5 2 3 1 1 3 4 2 4 1 2 2 3 1 1 3 3 5 5 3 2 3 3 2 4 3 3 3 3 3 5 5
 4 3 1 3 1 4 1 1 1 5 4 5 4 1 5 1 1 5 5 2 5 5 3 2 1 4 4 3 2 1 2 5 1 3 5 1 1
 2 3 4 4 2 2 1 3 5 1 1 3 5 4 4 5 2 3 1 3 4 5 1 3 2 5 3 5 3 1 3 2 2 3 2 4 1
 2 5 2 1 1 5 4 3 4 3 3 1 1 1 2 4 5 2 1 2 1 2 4 2 2 2 2 1 1 1 2 2 5 2 2 2 1
 1 1 4 4 1 1 1 2 5 4 4 4 3 2 2 4 2 4 1 1 3 3 3 1 1 3 3 4 2 1 1 1 1 2 1 2 2
 2 2 1 3 1 3 4 1 4 2 5 2 1 2 4 4 3 5 2 5 2 4 3 5 2 5 5 4 3 4 4 2 3 1 5 2 3
 5 2 4 1 4 3 1 3 2 3 3 2 2 2 4 3 2 3 2 4 3 1 3 3 1 5 4 4 2 4 1 2 2 2 1 4 4
 4 1 5 1 3 2 3 3 5 4 2 4 1 5 5 1 2 5 4 4 1 5 2 3 3 3 4 4 2 3 2 4 3 5 1 4 2
 4 5 4 4 1 3 1 1 3 5 5 2 3 3 1 2 2 4 2 4 4 1 2 3 1 2 2 1 4 1 4 5 1 1 5 2 4
 1 1 3 4 2 3 1 1 3 2 4 4 4 2 1 5 4 4 2 3 4 1 1 4 4 3 2 1 5 5 1 3 4 1 2 2 2
 1 1 4 1 2 4 2 2 1 2 3 2 2 4 3 4 3 4 5 3 4 5 4 3 5 2 4 2 4 5 4 1 2 2 3 5 3
 1]
[5 4 3 1 1 4 2 4 5 5 3 3 2 5 1 5 5 2 1 3 4 2 1 5 4 3 3 1 1 2 2 2 2 2 5 2 3
 3 4 4 5 3 5 2 3 1 1 

In [None]:
# 혼돈 행렬 출력
print('혼돈 행렬')
print(confusion_matrix(y_true = Y_test,
                       y_pred = pred)) #       1   2   3   4   5 예측
                                       #  1   95   0   2   2   2
                                       #  2    0 101   3   2   0
                                       #  3    0   2  82   2   0
                                       #  4    1   4   0  80   1
                                       #  5    0   0   1   1  64
                                       #  정
                                       #  답

혼돈 행렬
[[ 99   0   0   2   0]
 [  0 101   4   1   0]
 [  0   2  83   1   0]
 [  1   0   1  83   1]
 [  0   1   2   2  61]]


In [None]:
# 실제 기사 분류
news = ['Paul Pogbas second-half volley was enough to give Manchester United victory at Burnley and send them three points clear at the top of the Premier League.United dominated a contest in which Burnley failed to register a single shot on target until stoppage time.But they were struggling to make a breakthrough until Marcus Rashford picked Pogba out with an excellent cross to the edge of the area.The Frenchmans connection was perfect, although it took a deflection off Matthew Lowton to ensure the ball went past Nick Pope and into the Burnley net.Although Burnley had three decent chances in a frantic ending, United secured the win to head the table after 17 rounds of matches.']

news = A_token.texts_to_sequences(news)
print(news)

news = pad_sequences(news,
                     maxlen = MY_LEN,
                     padding = 'post',
                     truncating = 'post')
#print(news)

pred = model.predict(news)
pred = pred.argmax(axis = 1)
print('RNN 추측값 :', pred)

# {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}


[[630, 1, 64, 104, 4061, 1, 381, 605, 208, 658, 190, 434, 3380, 4401, 1770, 1051, 597, 31, 549, 373, 3380, 1219, 66, 1332, 1219, 2569, 463, 190, 1446, 1364, 1750, 619, 1, 4401, 534, 605, 2992, 1364, 503, 811, 578, 760, 1, 1, 14, 1, 1, 1, 1849, 605, 21, 1364, 2534, 1, 1, 1, 1309, 1, 569, 2404, 1, 2561, 872, 605, 1219, 2177, 1332, 1219, 826, 1219, 1, 1581, 1, 2599, 238, 221, 170, 1364, 1, 813, 2821, 1, 605, 660, 1219, 692, 293, 255, 2853, 1, 1770, 1, 1219, 4401, 115, 238, 4401, 3728, 31, 3291, 1370, 619, 1364, 1, 3342, 190, 2627, 1219, 58, 605, 392, 1219, 2367, 1, 579, 1, 1332, 1251]]
RNN 추측값 : [1]
