# 임베딩
*   사람이 사용하는 언어(자연어)를 컴퓨터가 이해할 수 있는 언어(숫자) 형태인 벡터로 변환한 결과 혹은 일련의 과정을 의미합니다.


## 역할
*   단어 및 문장 간 관련성 계산
*   의미적 혹은 문법적 정보의 함축

## 희소 표현 기반 임베딩
*   대부분의 값이 0으로 채워져 있는 경우이다.

### 원-핫 인코딩
>   주어진 텍스트를 숫자(벡터)로 변환해 주는 것입니다.
<img src="https://heung-bae-lee.github.io/image/why_is_sparse_matirx_one_hot_encoding.png" width="600" height="500"/>

단점 :
*   원-핫 벡터들은 하나의 요소만 1 값을 갖고 나머지는 모두 0인 희소 벡터를 갖습니다.
*   하나의 단어를 표현하는데 말뭉치(corpus)에 있는 큰 차원이 존재할 수 있습니다.

In [None]:
import pandas as pd
class2 = pd.read_csv("/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/data/class2.csv")

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
onehot_encoder = preprocessing.OneHotEncoder()

train_x = label_encoder.fit_transform(class2['class2'])
train_x

array([2, 2, 1, 0, 1, 0])

In [None]:
pd.get_dummies(class2['class2'])

Unnamed: 0,B,I,N
0,0,0,1
1,0,0,1
2,0,1,0
3,1,0,0
4,0,1,0
5,1,0,0


## 횟수 기반 임베딩
*   단어가 출현한 빈도를 고려하여 임베딩하는 방법입니다.
<img src="https://www.educative.io/api/edpresso/shot/5197621598617600/image/6596233398321152" width="600" height="500"/>

### CountVectorizer()
>   문서 집합에서 단어를 토큰으로 생성하고 **각 단어의 출현 빈도수**를 이용하여 인코딩해서 벡터를 만드는 방법입니다.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is last chance.',
    'and if you do not have this chance.',
    'you will never get any chance.',
    'will you do get this one?',
    'please, get this chance',
]
vect = CountVectorizer()
vect.fit(corpus)
# 문서-단어 행렬에서 각각의 열이 의미하는 바를 vocabulary_를 통해 확인하기
vect.vocabulary_

{'and': 0,
 'any': 1,
 'chance': 2,
 'do': 3,
 'get': 4,
 'have': 5,
 'if': 6,
 'is': 7,
 'last': 8,
 'never': 9,
 'not': 10,
 'one': 11,
 'please': 12,
 'this': 13,
 'will': 14,
 'you': 15}

In [None]:
# CountVectorizer() 적용 결과 배열로 변환하기.
vect.transform(['you will never get any chance.']).toarray()

array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]])

불용어를 제거한 카운터 벡터 생성

In [None]:
vect = CountVectorizer(stop_words=["and", "is", "please", "this"]).fit(corpus)
vect.vocabulary_

{'any': 0,
 'chance': 1,
 'do': 2,
 'get': 3,
 'have': 4,
 'if': 5,
 'last': 6,
 'never': 7,
 'not': 8,
 'one': 9,
 'will': 10,
 'you': 11}

## TF-IDF
*   TF(term frequency) : 문서 내에서 특정 단어가 출현한 빈도를 의미합니다.
*   IDF(inverse document frequency) : DF는 한 단어가 전체 문서에서 얼마나 공통적으로 많이 등장하는지 나타내는 값입니다. 많이 등장하는 특정 단어가 a, the라면 TF-IDF 가중치를 낮추어줄 필요가 있습니다. 따라서 DF 값이 클수록 TF-IDF 의 가중치 값을 낮추기 위해 DF값에 역수를 취하는데 이 값이 IDF 입니다.

<img src="https://class101.dev/images/thumbnails/tf-idf.png" width="600" height="500"/>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
doc = ['I like machine learning', 'I love deep learning', 'I run everyday']
tfidf_vectorizer = TfidfVectorizer(min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(doc)
doc_distance = (tfidf_matrix * tfidf_matrix.T)
print ('유사도를 위한', str(doc_distance.get_shape()[0]), 'x', str(doc_distance.get_shape()[1]), '행렬을 만들었습니다.')
print(doc_distance.toarray())

유사도를 위한 3 x 3 행렬을 만들었습니다.
[[1.       0.224325 0.      ]
 [0.224325 1.       0.      ]
 [0.       0.       1.      ]]


In [None]:
doc_distance.get_shape()

(3, 3)

## 예측 기반 임베딩
*   신경망 구조 혹은 모델을 이용하여 특정 문맥에서 어떤 단어가 나올지를 예측하면서 단어를 벡터로 만드는 방법입니다. (ex : Word2Vec) 

### Word2Vec
*   신경망 알고리즘으로, 주어진 텍스트에서 텍스트의 각 단어마다 하나씩 일련의 벡터를 출력합니다.
<img src="https://thebook.io/img/080289/548.jpg" width="600" height="500"/>

수행과정

*   일정한 크기의 윈도우(window)로 분할된 텍스트를 신경망 입력으로 사용합니다.
*   모든 분할된 텍스트는 한 쌍의 대상 단어와 컨텍스트로 네트워크에 공급됩니다.
*   네트워크의 은닉충에는 각 단어에 대한 가중치가 포함되어 있습니다.


In [None]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize 
import warnings 
warnings.filterwarnings(action = 'ignore') 
import gensim
from gensim.models import Word2Vec 

file_path = "/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/data/peter.txt"

sample = open(file_path, "r", encoding='UTF8')
s = sample.read() 

f = s.replace("\n", " ")
data = [] 

# 토큰화 적용
for i in sent_tokenize(f):
    temp = [] 
    for j in word_tokenize(i):
        temp.append(j.lower())
    data.append(temp) 

data[:5]

[['once',
  'upon',
  'a',
  'time',
  'in',
  'london',
  ',',
  'the',
  'darlings',
  'went',
  'out',
  'to',
  'a',
  'dinner',
  'party',
  'leaving',
  'their',
  'three',
  'children',
  'wendy',
  ',',
  'jhon',
  ',',
  'and',
  'michael',
  'at',
  'home',
  '.'],
 ['after',
  'wendy',
  'had',
  'tucked',
  'her',
  'younger',
  'brothers',
  'jhon',
  'and',
  'michael',
  'to',
  'bed',
  ',',
  'she',
  'went',
  'to',
  'read',
  'a',
  'book',
  '.'],
 ['she', 'heard', 'a', 'boy', 'sobbing', 'outside', 'her', 'window', '.'],
 ['he', 'was', 'flying', '.'],
 ['there', 'was', 'little', 'fairy', 'fluttering', 'around', 'him', '.']]

## CBOW
>   단어를 여러 개 나열한 후 이와 관련된 단어를 추정하는 방식입니다.
ex) "calm cat slept on the sofa" 라는 문장이 있을 때, "calm cat on the sofa"라는 문맥이 주어지면 "slept"를 예측하는 것입니다.

<img src="https://img1.daumcdn.net/thumb/R800x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FphEkS%2FbtqXSoISyn9%2FeI2vpCZ8svhF7X4U3JCTx0%2Fimg.png" width="600" height="500"/>

In [None]:
model1 = gensim.models.Word2Vec(data, min_count=1,size=100, window=5, sg=0)
print("Cosine similarity between 'peter' " + "'wendy' - CBOW :",
            model1.similarity('peter', 'wendy'))

Cosine similarity between 'peter' 'wendy' - CBOW : -0.05985912


*   data : CBOW를 적용할 데이터셋
*   min_count : 단어에 대한 최소 빈도수 제한 (빈도가 적은 단어들은 학습하지 않음)
*   size : 워드 벡터의 특징 값. 즉, 임베딩된 벡터의 차원
*   window : 컨텍스트 윈도우 크기
*   sg : default = 0, 0이면 CBOW / 1이면 skip-gram

In [None]:
print("Cosine similarity between 'peter' " +"'hook' - CBOW : ", 
      model1.similarity('peter', 'hook')) 

Cosine similarity between 'peter' 'hook' - CBOW :  0.17440854


## skip-gram

>   CBOW 방식과 반대로, 특정한 단어에서 문맥이 될 수 있는 단어를 예측합니다.

In [None]:
model2 = gensim.models.Word2Vec(data, min_count=1, size=100, window=5, sg=1)
print("Cosine similarity between 'peter' " + "'wendy' - Skip Gram :",
            model1.similarity('peter', 'wendy'))

Cosine similarity between 'peter' 'wendy' - Skip Gram : -0.05985912


In [None]:
print("Cosine similarity between 'peter' " +"'hook' - Skip Gram : ", 
      model2.similarity('peter', 'hook')) 

Cosine similarity between 'peter' 'hook' - Skip Gram :  0.5795201


## FastText

*   워드투벡터의 단점을 보완하고자 개발된 임베딩 알고리즘
*   노이즈에 강하며, 새로운 단어에 대해서는 형태적 유사성을 고려한 벡터 값을 얻습니다.

In [None]:
from tqdm import tqdm
from gensim.test.utils import common_texts
from gensim.models import FastText

corpus_fname = file_path
corpus = [sent.strip().split(" ") for sent in tqdm(open(corpus_fname, 'r', encoding='utf-8').readlines())] 
model = FastText(corpus, size=4, window=3, min_count=1, iter=10)

100%|██████████| 41/41 [00:00<00:00, 27274.62it/s]


In [None]:
sim_score = model.wv.similarity('peter', 'wendy')
print(sim_score)

-0.5548141


In [None]:
sim_score = model.wv.similarity('peter', 'hook')
print(sim_score)

-0.1171107


## 횟수 / 예측 기반 임베딩 (Glove)
*   횟수 기반의 LSA와 예측 기반의 워드투벡터 단점을 보완한 모델입니다.
*   단어에 대한 글로벌 동시 발생 확률 정보를 포함하는 단어 임베딩 방법입니다.
*   단어에 대한 통계정보와 skip-gram을 합친 방식입니다.


In [None]:
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/data/glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

(400000, 100)

glove2word2vec
*   함수를 사용하여 glove를 워드투벡터 형태로 변경할 수 있습니다. 
*   첫 번째 인자 : 글로브 입력파이
*   두 번째 인자 : 워드투벡터 출력 파일

In [None]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
# 'bill'과 유사한 단어 리스트 반환
model.most_similar('bill')

[('legislation', 0.8072140216827393),
 ('proposal', 0.7306863069534302),
 ('senate', 0.7142540812492371),
 ('bills', 0.7044401168823242),
 ('measure', 0.6958035230636597),
 ('passed', 0.6906244158744812),
 ('amendment', 0.6846879720687866),
 ('provision', 0.6845567226409912),
 ('plan', 0.6816462874412537),
 ('clinton', 0.6663139462471008)]

In [None]:
model.most_similar('cherry')

[('peach', 0.688809871673584),
 ('mango', 0.6838189959526062),
 ('plum', 0.6684104204177856),
 ('berry', 0.6590359210968018),
 ('grove', 0.6581551432609558),
 ('blossom', 0.6503506302833557),
 ('raspberry', 0.6477391719818115),
 ('strawberry', 0.6442098617553711),
 ('pine', 0.6390928626060486),
 ('almond', 0.6379213333129883)]

In [None]:
# 관련성이 없는 단어 리스트 반환
model.most_similar(negative=['cherry'])

[('kazushige', 0.4834350347518921),
 ('askerov', 0.4778186082839966),
 ('lakpa', 0.46915262937545776),
 ('ex-gay', 0.45713329315185547),
 ('tadayoshi', 0.4522106647491455),
 ('turani', 0.4481006860733032),
 ('saglam', 0.446959912776947),
 ('aijun', 0.4435269832611084),
 ('adjustors', 0.44235295057296753),
 ('nyum', 0.4423118233680725)]

In [None]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.7699


In [None]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]
analogy('australia', 'beer', 'france')

'champagne'

In [None]:
model.most_similar(positive=['beer', 'france'], negative=['australia'])

[('champagne', 0.6480064988136292),
 ('wine', 0.6029773354530334),
 ('cognac', 0.599911093711853),
 ('drink', 0.596866250038147),
 ('perfume', 0.5843736529350281),
 ('drinks', 0.5787434577941895),
 ('vodka', 0.5771392583847046),
 ('beers', 0.5634331703186035),
 ('anheuser', 0.5613827705383301),
 ('bourbon', 0.552852988243103)]

In [None]:
analogy('tall', 'tallest', 'long')

'longest'

In [None]:
# 열거된 단어 중 유사성이 가장 떨어지는 단어를 반환
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


# 트랜스포머 어텐션
*   어텐션은 주로 언어 번역에서 사용되기 때문에 인코더와 디코더 네트워크를 사용합니다.
*   입력에 대한 벡터 변환을 인코더에서 처리하고, 모든 벡터를 디코더로 보냅니다.
*   소프트맥스 함수를 사용하여 가중합을 구하고 그 값을 디코더에 전달합니다.
*   디코더는 은닉 상태에 대해 중점적으로 집중해서 보아야 할 벡터를 소프트맥스 함수로 점수를 매긴 후 각각을 은닉 상태의 벡터들과 곱합니다.

<img src="https://wikidocs.net/images/page/22893/dotproductattention1_final.PNG" width="600" height="500"/>

### seq2seq

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import os
import io
import re
import time
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata

In [None]:
# 데이터셋 전처리 함수 정의
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != "Mn")

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    w = w.rstrip().strip()
    w = '<start> ' + w + ' <end>'
    return w

In [None]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [None]:
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
 
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
 
    return zip(*word_pairs)
def max_length(tensor):
    return max(len(t) for t in tensor)
 
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(lang)
 
  tensor = lang_tokenizer.texts_to_sequences(lang)
 
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
 
  return tensor, lang_tokenizer
 
def load_dataset(path, num_examples=None):
    targ_lang, inp_lang = create_dataset(path, num_examples)
 
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
 
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [None]:
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset('/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/data/spa.txt', num_examples)
 
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
 
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

In [None]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1
 
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [None]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
 
  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state
 
  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
 
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE) 

In [None]:
class EDAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(EDAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
 
    def call(self, query, values):
        hidden_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)))
 
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights 
attention_layer = EDAttention(10)

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        self.attention = EDAttention(self.dec_units)
 
    def call(self, x, hidden, enc_output):
        context_vector, attention_weights = self.attention(hidden, enc_output)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        x = self.fc(output)
        return x, state, attention_weights
 
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
 
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

In [None]:
checkpoint_dir = '/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [None]:
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
        for t in range(1, targ.shape[1]):
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
            loss += loss_function(targ[:, t], predictions)
            dec_input = tf.expand_dims(targ[:, t], 1)
  batch_loss = (loss / int(targ.shape[1]))
  variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, variables)
  optimizer.apply_gradients(zip(gradients, variables)) 
  return batch_loss

In [None]:
EPOCHS = 10
 
for epoch in range(EPOCHS):
  start = time.time()
 
  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0
 
  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss
 
    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                     batch,
                                                     batch_loss.numpy()))
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)
 
  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.5663
Epoch 1 Batch 100 Loss 2.1583
Epoch 1 Batch 200 Loss 1.8494
Epoch 1 Batch 300 Loss 1.5967
Epoch 1 Loss 2.0099
Epoch 2 Batch 0 Loss 1.5570
Epoch 2 Batch 100 Loss 1.3037
Epoch 2 Batch 200 Loss 1.3536
Epoch 2 Batch 300 Loss 1.2132
Epoch 2 Loss 1.3353
Epoch 3 Batch 0 Loss 0.9554
Epoch 3 Batch 100 Loss 1.0212
Epoch 3 Batch 200 Loss 1.0636
Epoch 3 Batch 300 Loss 0.7478
Epoch 3 Loss 0.8928
Epoch 4 Batch 0 Loss 0.5662
Epoch 4 Batch 100 Loss 0.7125
Epoch 4 Batch 200 Loss 0.6115
Epoch 4 Batch 300 Loss 0.5927
Epoch 4 Loss 0.5928
Epoch 5 Batch 0 Loss 0.4097
Epoch 5 Batch 100 Loss 0.3655
Epoch 5 Batch 200 Loss 0.4150
Epoch 5 Batch 300 Loss 0.4341
Epoch 5 Loss 0.4025
Epoch 6 Batch 0 Loss 0.2421
Epoch 6 Batch 100 Loss 0.2794
Epoch 6 Batch 200 Loss 0.2486
Epoch 6 Batch 300 Loss 0.2565
Epoch 6 Loss 0.2775
Epoch 7 Batch 0 Loss 0.2108
Epoch 7 Batch 100 Loss 0.1845
Epoch 7 Batch 200 Loss 0.2701
Epoch 7 Batch 300 Loss 0.1977
Epoch 7 Loss 0.2012
Epoch 8 Batch 0 Loss 0.1029
Epoch 

In [None]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
 
    sentence = preprocess_sentence(sentence)
 
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                           maxlen=max_length_inp,
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    result = ''
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
 
    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                             dec_hidden,
                                                             enc_out)
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()
        predicted_id = tf.argmax(predictions[0]).numpy()
        result += targ_lang.index_word[predicted_id] + ' '
        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence, attention_plot
        dec_input = tf.expand_dims([predicted_id], 0)
 
    return result, sentence, attention_plot

In [None]:
def plot_attention(attention, sentence, predicted_sentence):
  fig = plt.figure(figsize=(10,10))
  ax = fig.add_subplot(1, 1, 1)
  ax.matshow(attention, cmap='viridis')

  fontdict = {'fontsize': 14}

  ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
  ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

  plt.show()

In [None]:
def translate(sentence):
    result, sentence, attention_plot = evaluate(sentence)
 
    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))
 
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

translate(u'esta es mi vida.')

Input: <start> esta es mi vida . <end>
Predicted translation: this is my life . <end> 


<IPython.core.display.Javascript object>

## BERT

*   한 문장에서 모든 단어의 연관성을 이해하며 검색 문장을 처리하는 모델입니다.
*   전이 학습 기법에 착안하여 사전에 학습된 신경망을 이용해서 목적에 맞게 후처리하는 과정을 거쳐 사용합니다.

In [None]:
!pip install bert-for-tf2
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-for-tf2
  Downloading bert-for-tf2-0.14.9.tar.gz (41 kB)
[K     |████████████████████████████████| 41 kB 187 kB/s 
[?25hCollecting py-params>=0.9.6
  Downloading py-params-0.10.2.tar.gz (7.4 kB)
Collecting params-flow>=0.8.0
  Downloading params-flow-0.8.2.tar.gz (22 kB)
Building wheels for collected packages: bert-for-tf2, params-flow, py-params
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2-0.14.9-py3-none-any.whl size=30535 sha256=09777146c5b28d3cb4936f1b4343b3cdd182083d052862b16061da3ebd2de870
  Stored in directory: /root/.cache/pip/wheels/47/b6/e5/8c76ec779f54bc5c2f1b57d2200bb9c77616da83873e8acb53
  Building wheel for params-flow (setup.py) ... [?25l[?25hdone
  Created wheel for params-flow: filename=params_flow-0.8.2-py3-none-any.whl size=19472 sha256=8ecea868cdd984ec66c6ff0bd306c63bafb59f

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert
import pandas as pd
movie_reviews = pd.read_csv("/content/drive/MyDrive/DL_example/딥러닝 텐서플로 교과서/data/IMDB Dataset.csv")
movie_reviews.isnull().values.any()
movie_reviews.shape

(50000, 2)

In [None]:
def preprocess_text(sen):
    sentence = remove_tags(sen)
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    return sentence

TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
    return TAG_RE.sub('', text)

reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

print(movie_reviews.columns.values)

['review' 'sentiment']


In [None]:
movie_reviews.sentiment.unique()

array(['positive', 'negative'], dtype=object)

In [None]:
y = movie_reviews['sentiment']
y = np.array(list(map(lambda x : 1 if x=="positive" else 0, y)))

In [None]:
print(reviews[10])

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 


In [None]:
print(y[10])

0


In [None]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In [None]:
tokenizer.tokenize("don't be so judgmental")

['don', "'", 't', 'be', 'so', 'judgment', '##al']

In [None]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("don't be so judgmental"))

[2123, 1005, 1056, 2022, 2061, 8689, 2389]

In [None]:
def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))
tokenized_reviews = [tokenize_reviews(review) for review in reviews]

In [None]:
import random

reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]
random.shuffle(reviews_with_len)
reviews_with_len.sort(key=lambda x: x[2])
sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))
BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))
next(iter(batched_dataset))

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 1045,  2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,
          2305,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2023,  3185,  2003,  6659,  2021,  2009,  2038,  2070,  2204,
    

In [None]:
import math

TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

In [None]:
class TEXT_MODEL(tf.keras.Model):
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        self.embedding = tf.keras.layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = tf.keras.layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = tf.keras.layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = tf.keras.layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = tf.keras.layers.GlobalMaxPool1D()
        self.dense_1 = tf.keras.layers.Dense(units=dnn_units, activation="relu")
        self.dropout = tf.keras.layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = tf.keras.layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = tf.keras.layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3)
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) 
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        return model_output

In [None]:
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 5

In [None]:
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

In [None]:
if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

text_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f1c73519910>

In [None]:
results = text_model.evaluate(test_data)
print(results)

[0.4139547049999237, 0.8996394276618958]
