# **NLP Introduction**
- In this notebook we are introduced to Natural Language Processing.
- We be looking at concerts that help us conver text to data that can be used in ML and DL model.

## **Libraries**
- We will making user of tensoflow in this notebook specifical the preprocessing text and sequence

In [1]:
import tensorflow as tf

In [3]:
Tokenizer = tf.keras.preprocessing.text.Tokenizer
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences1

## **Data**
- For demostration we create some sentences we will be using as data. 
- *Sentence created for some of saying and song to spang in mind*😏

In [4]:
sentences = [
             'My very earnest mother',
             'Just sent us',
             'No porridge please',
             'Yummy yummy make my tummy bummy',
             'Up up away it goes',
             'Mcdonald had a farm',
             'The farm had some goats',
             'The goats had some kids',
             'A meeh here a meeh there',
             'Everywhere meeh meeh',
             'Beast of England, beast of Ireland',
             'Beast of every land and clime',
             'Hence forth to my joyful tiding',
             'Of the future golden times',
             'Soon or late day shall come',
             'When man shall be over throne'
]
print(sentences[:5])

['My very earnest mother', 'Just sent us', 'No porridge please', 'Yummy yummy make my tummy bummy', 'Up up away it goes']


## **Tokenizer**
- Now will create our word tokenizer.
- We will need to specify the maximum number of words in our dictionary and in this case will use 100.
- This number represents the most common words in our sentences.
- We will also be using out of vocabulary(OOV),for cases in which we get sentence that have some words not contained in our dictionary. 


In [5]:
tokenizer = Tokenizer(num_words= 100, oov_token= '<OOV>')

__Comments__
- Now will applie this to our sentences.
- From this we will be able to generate word index which will have number representing our words.

In [6]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'meeh': 2, 'of': 3, 'my': 4, 'had': 5, 'a': 6, 'the': 7, 'beast': 8, 'yummy': 9, 'up': 10, 'farm': 11, 'some': 12, 'goats': 13, 'shall': 14, 'very': 15, 'earnest': 16, 'mother': 17, 'just': 18, 'sent': 19, 'us': 20, 'no': 21, 'porridge': 22, 'please': 23, 'make': 24, 'tummy': 25, 'bummy': 26, 'away': 27, 'it': 28, 'goes': 29, 'mcdonald': 30, 'kids': 31, 'here': 32, 'there': 33, 'everywhere': 34, 'england': 35, 'ireland': 36, 'every': 37, 'land': 38, 'and': 39, 'clime': 40, 'hence': 41, 'forth': 42, 'to': 43, 'joyful': 44, 'tiding': 45, 'future': 46, 'golden': 47, 'times': 48, 'soon': 49, 'or': 50, 'late': 51, 'day': 52, 'come': 53, 'when': 54, 'man': 55, 'be': 56, 'over': 57, 'throne': 58}


## **Sentences to Sequences**
- Now will convert our sequences to sequence of number from the word index we have just created.

In [7]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences[:5])

[[4, 15, 16, 17], [18, 19, 20], [21, 22, 23], [9, 9, 24, 4, 25, 26], [10, 10, 27, 28, 29]]


## **Padding Sequences**
- For our modeling process we need to make sure that the inputs are of the same length.
- The sentences we have created are of different length, resulting in sequence of different sizes.
- To command this will make use of padding, we add zeros to short sequences either and the beginning or at the end of the sequences.
- We can also determine the maximum number of inputs we want in a sequence hence will cut sequences long that the maximum number of inputs specified.

In [8]:
padded_sequences = pad_sequences(sequences)
print('Original sequences = ', sequences)
print('Padded sequences = ', padded_sequences)

Original sequences =  [[4, 15, 16, 17], [18, 19, 20], [21, 22, 23], [9, 9, 24, 4, 25, 26], [10, 10, 27, 28, 29], [30, 5, 6, 11], [7, 11, 5, 12, 13], [7, 13, 5, 12, 31], [6, 2, 32, 6, 2, 33], [34, 2, 2], [8, 3, 35, 8, 3, 36], [8, 3, 37, 38, 39, 40], [41, 42, 43, 4, 44, 45], [3, 7, 46, 47, 48], [49, 50, 51, 52, 14, 53], [54, 55, 14, 56, 57, 58]]
Padded sequences =  [[ 0  0  4 15 16 17]
 [ 0  0  0 18 19 20]
 [ 0  0  0 21 22 23]
 [ 9  9 24  4 25 26]
 [ 0 10 10 27 28 29]
 [ 0  0 30  5  6 11]
 [ 0  7 11  5 12 13]
 [ 0  7 13  5 12 31]
 [ 6  2 32  6  2 33]
 [ 0  0  0 34  2  2]
 [ 8  3 35  8  3 36]
 [ 8  3 37 38 39 40]
 [41 42 43  4 44 45]
 [ 0  3  7 46 47 48]
 [49 50 51 52 14 53]
 [54 55 14 56 57 58]]


__Comments__
- We will specify the maximum lenght we want in our sequence.

In [9]:
padded_sequences1 = pad_sequences(sequences, maxlen = 5)
print('Original sequences = ', sequences)
print('Padded sequences = ', padded_sequences1)

Original sequences =  [[4, 15, 16, 17], [18, 19, 20], [21, 22, 23], [9, 9, 24, 4, 25, 26], [10, 10, 27, 28, 29], [30, 5, 6, 11], [7, 11, 5, 12, 13], [7, 13, 5, 12, 31], [6, 2, 32, 6, 2, 33], [34, 2, 2], [8, 3, 35, 8, 3, 36], [8, 3, 37, 38, 39, 40], [41, 42, 43, 4, 44, 45], [3, 7, 46, 47, 48], [49, 50, 51, 52, 14, 53], [54, 55, 14, 56, 57, 58]]
Padded sequences =  [[ 0  4 15 16 17]
 [ 0  0 18 19 20]
 [ 0  0 21 22 23]
 [ 9 24  4 25 26]
 [10 10 27 28 29]
 [ 0 30  5  6 11]
 [ 7 11  5 12 13]
 [ 7 13  5 12 31]
 [ 2 32  6  2 33]
 [ 0  0 34  2  2]
 [ 3 35  8  3 36]
 [ 3 37 38 39 40]
 [42 43  4 44 45]
 [ 3  7 46 47 48]
 [50 51 52 14 53]
 [55 14 56 57 58]]


__Comments__
- Instead of the default paddig at the beginning of the sequence will now apply padding right at the end of the sequence.
- We will still maintain our maximum length of 5.

In [10]:
padded_sequences2 = pad_sequences(sequences, maxlen = 5, padding = 'post')
print('Original sequences = ', sequences)
print('Padded sequences = ', padded_sequences2)

Original sequences =  [[4, 15, 16, 17], [18, 19, 20], [21, 22, 23], [9, 9, 24, 4, 25, 26], [10, 10, 27, 28, 29], [30, 5, 6, 11], [7, 11, 5, 12, 13], [7, 13, 5, 12, 31], [6, 2, 32, 6, 2, 33], [34, 2, 2], [8, 3, 35, 8, 3, 36], [8, 3, 37, 38, 39, 40], [41, 42, 43, 4, 44, 45], [3, 7, 46, 47, 48], [49, 50, 51, 52, 14, 53], [54, 55, 14, 56, 57, 58]]
Padded sequences =  [[ 4 15 16 17  0]
 [18 19 20  0  0]
 [21 22 23  0  0]
 [ 9 24  4 25 26]
 [10 10 27 28 29]
 [30  5  6 11  0]
 [ 7 11  5 12 13]
 [ 7 13  5 12 31]
 [ 2 32  6  2 33]
 [34  2  2  0  0]
 [ 3 35  8  3 36]
 [ 3 37 38 39 40]
 [42 43  4 44 45]
 [ 3  7 46 47 48]
 [50 51 52 14 53]
 [55 14 56 57 58]]


__Comments__
- Now will change the maximum length of the sequence to 15, which is longer than may sentence we have.

In [11]:
padded_sequences3 = pad_sequences(sequences, maxlen = 15, padding = 'post')
print('Original sequences = ', sequences)
print('Padded sequences = ', padded_sequences3)

Original sequences =  [[4, 15, 16, 17], [18, 19, 20], [21, 22, 23], [9, 9, 24, 4, 25, 26], [10, 10, 27, 28, 29], [30, 5, 6, 11], [7, 11, 5, 12, 13], [7, 13, 5, 12, 31], [6, 2, 32, 6, 2, 33], [34, 2, 2], [8, 3, 35, 8, 3, 36], [8, 3, 37, 38, 39, 40], [41, 42, 43, 4, 44, 45], [3, 7, 46, 47, 48], [49, 50, 51, 52, 14, 53], [54, 55, 14, 56, 57, 58]]
Padded sequences =  [[ 4 15 16 17  0  0  0  0  0  0  0  0  0  0  0]
 [18 19 20  0  0  0  0  0  0  0  0  0  0  0  0]
 [21 22 23  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 9  9 24  4 25 26  0  0  0  0  0  0  0  0  0]
 [10 10 27 28 29  0  0  0  0  0  0  0  0  0  0]
 [30  5  6 11  0  0  0  0  0  0  0  0  0  0  0]
 [ 7 11  5 12 13  0  0  0  0  0  0  0  0  0  0]
 [ 7 13  5 12 31  0  0  0  0  0  0  0  0  0  0]
 [ 6  2 32  6  2 33  0  0  0  0  0  0  0  0  0]
 [34  2  2  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 8  3 35  8  3 36  0  0  0  0  0  0  0  0  0]
 [ 8  3 37 38 39 40  0  0  0  0  0  0  0  0  0]
 [41 42 43  4 44 45  0  0  0  0  0  0  0  0  0]
 [ 3  7 46

## **OOV Application**
- Now will demostrate what happens when you have sentence with words outside our dictionary
- Will generate some new sentences ( *inspired by the empty youghut drink on my desk hence will call it yog* 😆)

In [12]:
yog_sentences = [
                'Chocolate milk i hate both',
                'I still take hot coco',
                'Forest berry yummy yummy',
                'Caramel is the best with lemons',
                'The low fat berry got me pumped'
]
print(yog_sentences)

['Chocolate milk i hate both', 'I still take hot coco', 'Forest berry yummy yummy', 'Caramel is the best with lemons', 'The low fat berry got me pumped']


__Comments__
- Now will generate sequence for our new sentences.

In [13]:
yog_sequence = tokenizer.texts_to_sequences(yog_sentences)
print('yog Sequence = ', yog_sequence)

yog Sequence =  [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 9, 9], [1, 1, 7, 1, 1, 1], [7, 1, 1, 1, 1, 1, 1]]


__Comments__
- Every time we have a 1, it represents a word not in the previous generate word index.
- Final we demostrate applying max len and padding.

In [14]:
padded_yog_sequence = pad_sequences(yog_sequence, maxlen = 15, padding = 'post')
print('Original yog sequences = ', yog_sequence)
print('Padded yog sequences = ', padded_yog_sequence)

Original yog sequences =  [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 9, 9], [1, 1, 7, 1, 1, 1], [7, 1, 1, 1, 1, 1, 1]]
Padded yog sequences =  [[1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
 [1 1 9 9 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 7 1 1 1 0 0 0 0 0 0 0 0 0]
 [7 1 1 1 1 1 1 0 0 0 0 0 0 0 0]]


## **Conclusion**
- Now that we have basing understanding of how we can convert test in data for ML and DL models, next will apply this on sentiment analysis.