<a href="https://colab.research.google.com/github/pelinbalci/NLP_TF/blob/main/lecture_1_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Ref: https://www.youtube.com/watch?v=f5YJA5mQD5c&ab_channel=GoogleDevelopers

Common coding format is called ASCII.

Common letters and symbols are encoded into values from 0 - 255

It is useful in that 1 byte is needed to store the value for a letter. It has to be superseded by later encodings i.o to give access to characters and letters beyond 255. (in partivular international chars)


LISTEN : 076, 073, 083 084, 069, 078  -> 6 bytes

Here we will learn word-based encoding not letter-based.

LISTEN and SILENT have same letters and opposite meanings :)

A computer doesn't tell the difference btw these two words with letterbased ancoding unless we have a sequence model.( it is a bit complicated)


**Word-Based Encoding**

Each word respresented by a single number. 




In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# num words precifies the maximum number of words that you want to care about. 
# assign tokens to words based on how commonly used they are in corpus. 
# most common word will be at index 1. 
# ! is automtically removed:)

tokenizer = Tokenizer(num_words = 100) 
tokenizer.fit_on_texts(sentences) 
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'merhaba, benim adım pelin',
    'benim kedim var',
    'sen neredesin?',
    'sen buraya gel.',
    'benim evim nerede nerede nerede',
    'kedim',
    'adım xxx',
    'Sen',
    'sen',
    'Sen',
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

counts = tokenizer.word_counts
print(counts)

texttoseq = tokenizer.texts_to_sequences(sentences)
print(texttoseq)

{'sen': 1, 'benim': 2, 'nerede': 3, 'adım': 4, 'kedim': 5, 'merhaba': 6, 'pelin': 7, 'var': 8, 'neredesin': 9, 'buraya': 10, 'gel': 11, 'evim': 12, 'xxx': 13}
OrderedDict([('merhaba', 1), ('benim', 3), ('adım', 2), ('pelin', 1), ('kedim', 2), ('var', 1), ('sen', 5), ('neredesin', 1), ('buraya', 1), ('gel', 1), ('evim', 1), ('nerede', 3), ('xxx', 1)])
[[6, 2, 4, 7], [2, 5, 8], [1, 9], [1, 10, 11], [2, 12, 3, 3, 3], [5], [4, 13], [1], [1], [1]]


In [4]:
test_data = [
             'adım nerede yazıyor?',
             'benim kedim nerede',
             'sen evde misin'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

# In the frist sentence yazıyor is missing. 
# In the second sentence evde misin are missing. Since they are not in index. 

[[4, 3], [2, 5, 3], [1]]


# How to deal with unseen word?

Define oov: out of vocab in Tokenizer.


In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'merhaba, benim adım pelin',
    'benim kedim var',
    'sen neredesin?',
    'sen buraya gel.',
    'benim evim nerede nerede nerede',
    'kedim',
    'adım xxx',
    'Sen',
    'sen',
    'Sen',
]

test_data = [
             'adım nerede yazıyor?',
             'benim kedim nerede',
             'sen evde misin'
]

# create tokenizer
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# fit data
tokenizer.fit_on_texts(sentences)

# get word index
word_index = tokenizer.word_index
print(word_index)

# get index of words in each sentences
sequence = tokenizer.texts_to_sequences(sentences)
print(sequence)

# get index of words in test_data
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

{'<OOV>': 1, 'sen': 2, 'benim': 3, 'nerede': 4, 'adım': 5, 'kedim': 6, 'merhaba': 7, 'pelin': 8, 'var': 9, 'neredesin': 10, 'buraya': 11, 'gel': 12, 'evim': 13, 'xxx': 14}
[[7, 3, 5, 8], [3, 6, 9], [2, 10], [2, 11, 12], [3, 13, 4, 4, 4], [6], [5, 14], [2], [2], [2]]
[[5, 4, 1], [3, 6, 4], [2, 1, 1]]


# Padding

Before training a set we needed to have some level of uniformity of size. 

Each row is in the same length. 

If you want padding is located in post you need to add padding='post'. If you want you can limit the maxmum length with maxlen=5. If you have a sentence including more than 5 words, then you will loose info from previous word. You can also change it with truncating='post' 

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'merhaba, benim adım pelin',
    'benim kedim var',
    'sen neredesin?',
    'sen buraya gel.',
    'benim evim nerede nerede nerede biliyor',
    'kedim',
    'adım xxx',
    'Sen',
    'sen',
    'Sen',
]

# create tokenizer
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# fit data
tokenizer.fit_on_texts(sentences)

# get word index
word_index = tokenizer.word_index
print(word_index)

# get index of words in each sentences
sequence = tokenizer.texts_to_sequences(sentences)
print(sequence)

# padding
print('basic padded')
padded = pad_sequences(sequence)
print(padded)

print('padded post')
padded_post = pad_sequences(sequence, padding='post')
print(padded_post)

print('padded post, maxlen')
padded_post_max = pad_sequences(sequence, padding='post', maxlen=4)
print(padded_post_max)

print('padded post, maxlen post')
padded_post_max_post = pad_sequences(sequence, padding='post', truncating='post', maxlen=4)
print(padded_post_max_post)

{'<OOV>': 1, 'sen': 2, 'benim': 3, 'nerede': 4, 'adım': 5, 'kedim': 6, 'merhaba': 7, 'pelin': 8, 'var': 9, 'neredesin': 10, 'buraya': 11, 'gel': 12, 'evim': 13, 'biliyor': 14, 'xxx': 15}
[[7, 3, 5, 8], [3, 6, 9], [2, 10], [2, 11, 12], [3, 13, 4, 4, 4, 14], [6], [5, 15], [2], [2], [2]]
basic padded
[[ 0  0  7  3  5  8]
 [ 0  0  0  3  6  9]
 [ 0  0  0  0  2 10]
 [ 0  0  0  2 11 12]
 [ 3 13  4  4  4 14]
 [ 0  0  0  0  0  6]
 [ 0  0  0  0  5 15]
 [ 0  0  0  0  0  2]
 [ 0  0  0  0  0  2]
 [ 0  0  0  0  0  2]]
padded post
[[ 7  3  5  8  0  0]
 [ 3  6  9  0  0  0]
 [ 2 10  0  0  0  0]
 [ 2 11 12  0  0  0]
 [ 3 13  4  4  4 14]
 [ 6  0  0  0  0  0]
 [ 5 15  0  0  0  0]
 [ 2  0  0  0  0  0]
 [ 2  0  0  0  0  0]
 [ 2  0  0  0  0  0]]
padded post, maxlen
[[ 7  3  5  8]
 [ 3  6  9  0]
 [ 2 10  0  0]
 [ 2 11 12  0]
 [ 4  4  4 14]
 [ 6  0  0  0]
 [ 5 15  0  0]
 [ 2  0  0  0]
 [ 2  0  0  0]
 [ 2  0  0  0]]
padded post, maxlen post
[[ 7  3  5  8]
 [ 3  6  9  0]
 [ 2 10  0  0]
 [ 2 11 12  0]
 [ 3 13  4 

In [14]:
test_data = [
             'adım nerede yazıyor?',
             'benim kedim nerede',
             'sen evde misin acaba benimle gelecek'
]
test_sequence = tokenizer.texts_to_sequences(test_data)

# padding
print('basic padded')
padded = pad_sequences(test_sequence)
print(padded)

print('padded post')
padded_post = pad_sequences(test_sequence, padding='post')
print(padded_post)

print('padded post, maxlen')
padded_post_max = pad_sequences(test_sequence, padding='post', maxlen=4)
print(padded_post_max)

print('padded post, maxlen post')
padded_post_max_post = pad_sequences(test_sequence, padding='post', truncating='post', maxlen=4)
print(padded_post_max_post)

basic padded
[[0 0 0 5 4 1]
 [0 0 0 3 6 4]
 [2 1 1 1 1 1]]
padded post
[[5 4 1 0 0 0]
 [3 6 4 0 0 0]
 [2 1 1 1 1 1]]
padded post, maxlen
[[5 4 1 0]
 [3 6 4 0]
 [1 1 1 1]]
padded post, maxlen post
[[5 4 1 0]
 [3 6 4 0]
 [2 1 1 1]]
