# Natural language processing in TensorFlow
Text is messier than images (short and long sentences). How to represent words and sentences as numbers for learning?
Week 1: load text, preprocess it and setup the data so it can be fed to NN.
We start by building models to represent text (sentiment in text). It will be trained on a labeled text and then classify a new text based on what they've seen.
Pixel values were already numbers, but what happens with text? How can we do a sentiment analysis with words?

## Word based encodings
We could take character encodings for each character of a set (e.g. ASCII), but would that help us to understand the meaning of the word?
```
`LISTEN` and `SILENT` - would have the same representation but they mean different things!
```

What if we give value to each word?
```
I Love my dog
1 2     3  4

I Love my cat
1 2     3  5

Now we can see some similarities in sentences
```

In [3]:
## Using APIs
from tensorflow.keras.preprocessing.text import Tokenizer # one of the ways to encode words and transform sentences into vectors
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!', # the '!' will be stripped off by tokenizer
    'Do you think my dog is amazing?' # a longer sentence to demonstrate padding
]

# num_words - a total amount of unique distinct words; Tokenizer will take top 100 words by value and encode those
# Worth experimenting with. sometimes the impact of less words can be minimal in accuracy but huge in training time. Use with care!
# Tokenizer strips punctuation out and lowercases the sentence
# Out Of Vocabulary - a special token for unknown words, it must be something unique that does not appear in text.
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences) # takes the data and encodes it
word_index = tokenizer.word_index # returns a dictionary of `word->word_token`
print(word_index)

# Text to sequence: turn sentences into a list of numbers based on the tokens.
sequences = tokenizer.texts_to_sequences(sentences) # will convert sentences to a set of sequences
print(sequences)

padded = pad_sequences(sequences)
# to put padding zeros in the end and override max sentence length; this might lead to loss of data (truncating) that can also be overridden:
# padded = pad_sequences(sequences, padding='post', truncating='post, maxlen=5)
print(padded)


{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


## Looking more at the Tokenizer
Lessons learned:
- we need a lot of training data to get a broad vocabulary to make sure we do not get many unknown words
- In many cases it is a good idea to set a special value for unknown words instead of just ignoring them using `oov_token`

## Padding
Padding is a technique that provides a uniformity of text size.

## Sarcasm dataset
[High quality dataset for the task of Sarcasm Detection](https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection)

In [None]:
import json

with open("sarcasm.json", "r") as f:
    datastore = json.load(f)

sentences = []
labels = []
urls = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

# Working with tokenizer:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padded='post')
print(padded[0])
print(padded.shape)




