# Natural Language Processing Notes

In [27]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
print(tf.__version__)
print(keras.__version__)

2.0.0-alpha0
2.2.4-tf


###  Tokenizer.
This will handle the heavy lifting for us,
generating the dictionary of
word encodings and creating vectors out of the sentence

A passive parameter num wards to it.
In this case, I'm using 100 which is way too big,
as there are only five distinct words in this data.
If you're creating a training set based on lots of text,
you usually don't know
how many unique distinct words there are in that text.


##### Both real world data execise and example data are also added

In [3]:
sentences = [
    "I love my cat",
    "I love my dog!",
    "You love my dog"
]
tokenizer = Tokenizer(num_words = 100)

#### So by setting this hyperparameter,
what the tokenizer will do is take
the top 100 words by volume and just encode those.
#### It's a handy shortcut when dealing with lots of data,
and worth experimenting with when you
train with real data later in this course.
#### Sometimes the impact ofless words can be minimal and training accuracy,but huge in training time,but do use it carefully

In [4]:
tokenizer.fit_on_texts(sentences)

#### The fit on texts method of the tokenizer then takes in the data and encodes it

In [5]:
word_index = tokenizer.word_index

#### The tokenizer provides a word index property whichreturns a dictionary containing key value pairs,where the key is the word,and the value is the token for that word,which you can inspect by simply printing it out.You can see the results here.

In [6]:
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


### Text to Sequence

In [7]:
sequence = tokenizer.texts_to_sequences(sentences)

In [8]:
print(sequence)

[[3, 1, 2, 5], [3, 1, 2, 4], [6, 1, 2, 4]]


 One really handy thing aboutthis that you'll use later is the factthat the text to sequences  called can take any set of sentences,so it can encode them based on the word set that itlearned from the one that was passed into fit on texts.This is very significant if you think ahead a little bit.If you train a neural network on a corpus of texts,and the text has a word index generated from it,then when you want to do inference with the trainmodel,you'll have to encode the text that you want toinfer on with the same word index,otherwise it would be meaningless. 

In [9]:
test_data = [
    'i really love my dog',
    'my dog loves my face'
]

In [12]:
# tokenizer.fit_on_texts(test_data)
# word_index = tokenizer.word_index 
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_seq)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'cat': 5, 'you': 6, 'really': 7, 'loves': 8, 'face': 9}
[[4, 7, 2, 1, 3], [1, 3, 8, 1, 9]]


Is this thing is error or what?


In [41]:
# To overcome this we use
tokenizer = Tokenizer(num_words= 100, oov_token='<OOV>')
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_seq)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'cat': 5, 'you': 6, 'really': 7, 'loves': 8, 'face': 9}
[[None, None, None, None, None], [None, None, None, None, None]]


### PADDING 


#### Next up is padding.As we mentioned earlier when we werebuilding neural networks to handle pictures.When we fed them into the network for training,we needed them to be uniform in size.
### Often,we use the generators to resize the image to fit for example
#### With texts you'll facea similar requirement before you can train with texts,we needed to have some level of uniformity of size,so padding is your friend there.

In [31]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [33]:
padded = pad_sequences(test_seq, padding='post', maxlen=4)
print(padded)

[[7 2 1 3]
 [3 8 1 9]]


In [36]:
padded = pad_sequences(test_seq, padding='post',
#                        truncating='post',
                       #It will ignore omiting the starting va;ue in matrix
                       maxlen=3, #default is 3
                      )
print(padded)

[[2 1 3]
 [8 1 9]]


In [37]:
padded = pad_sequences(test_seq, padding='post',
                       truncating='post',
                       #It will ignore omiting the starting va;ue in matrix
                       maxlen=3, #default is 3
                      )
print(padded)

[[4 7 2]
 [1 3 8]]


# Sarcasm dataset in kaggle Execise


#### Sarcasm in News Headlines Dataset by Rishabh Misra:- https://rishabhmisra.github.io/publications/


In [18]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv

--2019-08-19 11:20:38--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.163.208, 2404:6800:4007:810::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.163.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5057493 (4.8M) [application/octet-stream]
Saving to: ‘/tmp/bbc-text.csv’


2019-08-19 11:20:40 (5.66 MB/s) - ‘/tmp/bbc-text.csv’ saved [5057493/5057493]



In [20]:
import csv

In [21]:
#Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]


In [24]:
sentences2 = []
labels = []
with open('/tmp/bbc-text.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentencess = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentencess = sentencess.replace(token, " ")
            sentencess = sentencess.replace(" ", " ")
        sentences2.append(sentencess)

In [25]:
print(len(sentences2))
print(sentences2[0])

2225
tv future hands viewers home theatre systems  plasma high-definition tvs  digital video recorders moving living room  way people watch tv will radically different five years  time.  according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes. us leading trend  programmes content will delivered viewers via home networks  cable  satellite  telecoms companies  broadband service providers front rooms portable devices.  one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes  like us s tivo uk s sky+ system  allow people record  store  play  pause forward wind tv programmes want.  essentially  technology allows much personalised tv. also built-in high-definition tv sets  big business japan us  slower take off europe lack high-definition programming. not can people forward wind adverts  can also forget abiding network channel schedules  putting together a-la-carte entertainment

In [28]:
tokenizer = Tokenizer(oov_token="<OOV>")

In [29]:
tokenizer.fit_on_texts(sentences2)
word_index = tokenizer.word_index
print(len(word_index))

29714


In [34]:
bbc_sequences = tokenizer.texts_to_sequences(sentences2)
bbc_padded = pad_sequences(bbc_sequences, padding="post")
print(bbc_padded[0])
print(bbc_padded.shape)

[  96  176 1158 ...    0    0    0]
(2225, 2441)


In [36]:
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)

In [37]:
print(label_seq)
print(label_word_index)

[[4], [2], [1], [1], [5], [3], [3], [1], [1], [5], [5], [2], [2], [3], [1], [2], [3], [1], [2], [4], [4], [4], [1], [1], [4], [1], [5], [4], [3], [5], [3], [4], [5], [5], [2], [3], [4], [5], [3], [2], [3], [1], [2], [1], [4], [5], [3], [3], [3], [2], [1], [3], [2], [2], [1], [3], [2], [1], [1], [2], [2], [1], [2], [1], [2], [4], [2], [5], [4], [2], [3], [2], [3], [1], [2], [4], [2], [1], [1], [2], [2], [1], [3], [2], [5], [3], [3], [2], [5], [2], [1], [1], [3], [1], [3], [1], [2], [1], [2], [5], [5], [1], [2], [3], [3], [4], [1], [5], [1], [4], [2], [5], [1], [5], [1], [5], [5], [3], [1], [1], [5], [3], [2], [4], [2], [2], [4], [1], [3], [1], [4], [5], [1], [2], [2], [4], [5], [4], [1], [2], [2], [2], [4], [1], [4], [2], [1], [5], [1], [4], [1], [4], [3], [2], [4], [5], [1], [2], [3], [2], [5], [3], [3], [5], [3], [2], [5], [3], [3], [5], [3], [1], [2], [3], [3], [2], [5], [1], [2], [2], [1], [4], [1], [4], [4], [1], [2], [1], [3], [5], [3], [2], [3], [2], [4], [3], [5], [3], [4], [2],