# Note for Week 1 of NLP-Tensorflow

course originally supported by [Coursera](https://www.coursera.org/)


## Focus on text,
1. We'll start by looking at sentiment in text
1. And learn how to build models that understand text that are trained on labeled text
1. Then we can then classify new text based on what they've seen.

## Focus on works
- if we take just letters  
  that means with `ASCII`
- e.g. LISTEN & SILENT
- Give a value to each word

## Using APIs

Tokenizer
=====

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

- This will handle the heavy lifting for us, 
- generating the dictionary of word encodings 
- and creating vectors out of the sentences. 

In [2]:
sentences = [
    'I love my dog',
    'I love my cat'
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


* num_words: how many distinct words to encode
* less_words:  
    - Sometimes the impact of less words can be minimal and training accuracy,
    - but huge in training time, 
    - but do use it carefully
* word_index:  
  The tokenizer provides a word index property which returns a `dictionary` containing key value pairs, where the **key** is the word, and the **value** is the token for that word

In [3]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


### attention
- `I` was transfered to 'i'
- `dog!` was transfered to 'dog'

One really handy thing about this that you'll use later is the fact that  
**the `text_to_sequences` called can take any set of sentences, so it can encode them based on the word set that it learned from the one that was passed into fit on texts.

**Note that** we need the same dictionary of index as a base to learn.

In [6]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
    
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In [9]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
test_sequences = tokenizer.texts_to_sequences(test_data)
print(test_sequences)

[[4, 2, 1, 3], [1, 3, 1]]


### Out of value
- First of all, we may really need lots of training datas
- Second don't ignore but take some value of unknown word
    - *Let's look at `<OOV>`*    
    which means: Out Of Vocabulary

In [12]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
    
]

tokenizer = Tokenizer(num_words=100 , oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)


{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


### Padding
to have the same forme just like what we do in the convolutional part of this course

Once the tokenizer has created the sequences, these sequences can be passed to pad sequences in order to have them padded like this

Each sentence will have the same size

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [15]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
    
]

tokenizer = Tokenizer(num_words=100 , oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)

print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [18]:

padded = pad_sequences(sequences, padding = 'post', maxlen = 5,  #by default if the sentence is longer, we dismiss from 'pre' (forword)
                      truncating = 'post') 

print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


In [19]:
padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSeqences = " , sequences)
print("\nPadded: \n" ,padded)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
test_sequences = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_sequences)

test_pad = pad_sequences(test_sequences, maxlen=10)
print("\nPadded Test Sequence: \n", test_pad)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Seqences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded: 
 [[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
 [[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]
