# Lab11-1

<table class="tfo-notebook-buttons" align="left">
   <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/text_classification_rnn.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

# Text Generation 
The example below shows how to encode two sentences 

'You are the Best', and   'You are the Nice' based on words using TensorFlow.

### 1. Word-based encoding

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [2]:
sentences = [
  'You are the Best',
  'You are the Nice'
]

- fit_on_texts() 메서드는 문자 데이터를 입력받아서 리스트의 형태로 변환합니다.
- tokenizer의 word_index 속성은 단어와 숫자의 키-값 쌍을 포함하는 딕셔너리를 반환합니다.
- 출력 결과를 보면 대문자 ‘I’는 소문자 ‘i’로 변환된 것을 알 수 있습니다.

In [8]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

Words that have not been indexed in advance are indexed as "OOV"

In [9]:
print(word_index)
print('----------------------------------------')
total_words = len(tokenizer.word_index) + 1
print('total_words=',total_words)

{'<OOV>': 1, 'you': 2, 'are': 3, 'the': 4, 'best': 5, 'nice': 6}
----------------------------------------
total_words= 7


### 2. Converting text into a sequence
텍스트를 시퀀스로 변환하기

In [11]:
sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

{'<OOV>': 1, 'you': 2, 'are': 3, 'the': 4, 'best': 5, 'nice': 6}
[[2, 3, 4, 5], [2, 3, 4, 6]]


### 3. Setting up padding

You have to padding to make the sentence the same length.
- Padding uses the pad_sequences function.

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

Sequences are text sentences converted into sequences of integers
- Since the longest sequence is 7, it has all been converted into sequences of the same length

In [18]:
padded = pad_sequences(sequences)

print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'you': 2, 'are': 3, 'the': 4, 'best': 5, 'nice': 6}
[[2, 3, 4, 5], [2, 3, 4, 6]]
[[2 3 4 5]
 [2 3 4 6]]


#### padding parameter : 'pre', 'post'
If the padding parameter is specified as 'post', padding is filled after the sequence. The default is "pre"

In [14]:
padded = pad_sequences(sequences, padding='post')
print(padded)

[[2 3 4 5]
 [2 3 4 6]]


### 4. Encoding in binary form

In [15]:
# 이진 형태로 인코딩합니다.
binary_results = tokenizer.sequences_to_matrix(sequences, mode = 'binary')
print(f'binary_vectors:\n {binary_results}\n')

binary_vectors:
 [[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]]



In [16]:
print(f'One-Hot Encodeing:',to_categorical(sequences))

One-Hot Encodeing: [[[0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0.]]

 [[0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1.]]]


In [17]:
test_text = ['You are the One']
test_seq = tokenizer.texts_to_sequences(test_text)

print(f'test sequences: {test_seq}')

test sequences: [[2, 3, 4, 1]]
