# Natural language processing in TensorFlow
Text is messier than images (short and long sentences). How to represent words and sentences as numbers for learning?
Week 1: load text, preprocess it and setup the data so it can be fed to NN.
We start by building models to represent text (sentiment in text). It will be trained on a labeled text and then classify a new text based on what they've seen.
Pixel values were already numbers, but what happens with text? How can we do a sentiment analysis with words?

## Word based encodings
We could take character encodings for each character of a set (e.g. ASCII), but would that help us to understand the meaning of the word?
```
`LISTEN` and `SILENT` - would have the same representation but they mean different things!
```

What if we give value to each word?
```
I Love my dog
1 2     3  4

I Love my cat
1 2     3  5

Now we can see some similarities in sentences
```

In [5]:
## Using APIs
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer # one of the ways to encode words and transform sentences into vectors

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!' # the '!' will be stripped off by tokenizer
]

# num_words - a total amount of unique distinct words; Tokenizer will take top 100 words by value and encode those
# Worth experimenting with. sometimes the impact of less words can be minimal in accuracy but huge in training time. Use with care!
# Tokenizer strips punctuation out and lowercases the sentence
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences) # takes the data and encodes it
word_index = tokenizer.word_index # returns a dictionary of `word->word_token`
print(word_index)


{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
