<h2>개체명 인식(Named Entity Recognition)</h2>
<b>텍스트에서 이름을 가진 개체를 인식하는 기술</b><br>
<b>'철수와 영희는 밥을 먹었다' 에서 철수-이름, 영희-이름, 밥-사물</b><br>

In [19]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [2]:
import nltk

In [None]:
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

<b>토큰화 및 품사 태깅</b>

In [4]:
from nltk import word_tokenize, pos_tag, ne_chunk

In [None]:
sentence = 'James is working at Disney in London'
sentence = pos_tag(word_tokenize(sentence))
print(sentence)

<b>개체명 인식</b>

In [None]:
sentence = ne_chunk(sentence)

In [11]:
import numpy as np
import urllib.request

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [None]:
tagged_sentences = []
sentence = []

with urllib.request.urlopen('https://raw.githubusercontent.com/Franck-Dernoncourt/NeuroNER/master/neuroner/data/conll2003/en/train.txt') as f:
    for line in f:
        line = line.decode('utf-8')
        if len(line) ==0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                tagged_sentences.append(sentence)
                sentence = p
            continue
        splits = line.strip().split(' ')
        word = splits[0].lower()
        sentence.append([word, splits[-1]])
        
print(len(tagged_sentences))
print(tagged_sentences[0])

<b>데이터 전처리</b>

In [None]:
sentences , ner_tags = [], []

for tagged_sentence in tagged_sentences:
    sentences, tag_info = zip(*tagged_sentence)
    sentences.append(list(sentence))
    ner_tags.append(list(tag_info))

<b>정제 및 빈도 수가 높은 상위 단어들만 추출 위한 토큰화</b>

In [None]:
max_words = 4000
src_tokenizer = Tokenizer(num_words = max_words, oov_token = 'OOV')
src_tokenizer.fit_on_texts(sentences)

tar_tokenizer = Tokenizer()
tar_tokenizer.fit_on_texts(ner_tags)

In [None]:
vocab_size = max_words
tag_size = len(tar_tokenizer.word_index) + 1

print(vocab_size)
print(tag_size)

<b>데이터 학습에 활용하기 위해 데이터를 배열로 변환, texts_to_sequences()</b>

In [None]:
max_len = 70
x_train = pad_sequences(x_train, padding, padding = 'post', maxlen = max_len)
y_train = pad_sequences(y_train, padding, padding = 'post', maxlen = max_len)

<b>훈련, 실험 데이터 분리 및 원 핫 인코딩 실행</b>

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size = .2, random_state = 111)

y_train = to_categorical(y_train, num_classes = tag_size)
y_test = to_categorical(y_test, num_classes = tag_size)

<b>최종 데이터 셋 크기: </b>

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

<b>모델 구축 및 학습</b>

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed
from keras.optimizers import Adam

In [None]:
model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = 128, input_length = max_len, mask_zero = True))
model.add(Bidirectional(LSTM(256, return_sequences = True)))
model.add(TimeDistributed(Dense(tag_size, activation = 'softmax')))
model.summary()

<b>모델 컴파일 및 학습 진행, 평가</b>

In [None]:
model.compile(loss = 'categorical_crossentropy',
             optimizer = Adam(0.001),
             metrics = ['accuracy'])
model.fit(x_train, y_train, batch_size = 128, epochs = 3, validation_data = (x_test, y_test))

In [None]:
model.evaluate(x_test, y_test)

<b>학습한 모델을 통한 예측, 인덱스를 단어로 변환해줄 사전이 필요, 토큰화 툴의 사전 이용</b>

In [None]:
idx2word = src_tokenizer.inde_word
idx2ner = tar_tokenizer.index_word
idx2ner[0] = 'PAD'

<b>예측 시각화</b>

In [None]:
i = 10
y_predicted = model.predict(np.array[x_test[i]])
y_predicted = np.argmax(y_predicted, axis = -1)
true = np.argmax(y_test[i], -1)

print("{:15}|{:5}|{}".format("단어", "실제값", "예측값"))
print("-", *34)

for w, t, pred in zip(x_test[i], true, y_predicted[0]):
    if w != 0:
        print("{:17}: {:7} {}".format(idx2word[w], idx2ner[t].upper(), idx2ner[pred].upper()))