<a href="https://colab.research.google.com/github/koushikpr/Machine-Learning-Prerequisites/blob/Natural-Language-Processing/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing**

---



Natural Language Processing is a discipline or a field that deals with the communication btw computer language and  human language.

**Recurrent Neural Network:** This type of neural network is much capable of processing sequential data such as text or characters. We will be learning 
1. Sentimental Analysis
2. Character Generation




##Sequence Data
Until now, we have been dealing with time invariant data. When dealing with Data such as videos,texts,audio,weather patterns the concept of time variance comes into concern.

In the case of texts, data cannot be processed in random order. Texts and Characters in a proper order allows us process accurate results. Neural network and machine learning model cannot process raw text data. They need to be converted to numerical values for the model to process data.

## Encoding Text Sequence

First step, we are going to create a bag of words, which basically assigns a numerical value to every word in a sentence and keeps the frequency(repeatation) of every word.

In [1]:
vocab = {}
word_code = 1
def bow(text):
  global word_code
  words = text.lower().split(' ')#splits the text to words of lower case
  bag={}
  for word in words:
    if word in vocab:
      encoding  = vocab[word]#extract the encoded data from the dictionary
    else:
      vocab[word]=word_code#assigning a new encoding data to new words
      encoding = word_code
      word_code +=1#incrementing label to assign to new words
    if encoding in bag:
      bag[encoding]+=1#counting the label's frequency
    else:
      bag[encoding]=1#assigning frequency to new words

  return bag

text = "this is a test to see if this test will work is is test a a"
print(bow(text))
print(vocab)


{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


This method cannot differ between different sentences, lets try integer encoding

In [2]:
vocab = {}
wc = 1
def ic(text):
  global wc
  words = text.lower().split(' ')
  encoding = []

  for word in words:
    if word in vocab:
      code = vocab[word]
      encoding.append(code)
    else:
      vocab[word]=wc
      encoding.append(wc)
      wc+=1
    
  return encoding

text = "this is a test to see if this test will work is is test a a"
encodin = ic(text)
print(encodin)
print(vocab)




[1, 2, 3, 4, 5, 6, 7, 1, 4, 8, 9, 2, 2, 4, 3, 3]
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In this case, location of the texts and their frequencies are recorded.

In [3]:
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_encode = ic(positive_review)
neg_encode = ic(negative_review)

print("Positive:", pos_encode)
print("Negative:", neg_encode)

Positive: [10, 11, 12, 13, 14, 15, 5, 16, 17, 18, 19, 14, 20, 21]
Negative: [10, 11, 12, 13, 14, 15, 5, 16, 21, 18, 19, 14, 20, 17]


Hence due to location shift, they. can be classified into 2 data types. But this method cannot link between synonyms and treats meaning differently. For this case we use word embedding.


## Recurrent Neural Network (RNN)

1. **Simple RNN:**Until now we have been using a feed-forward network.A RNN will process one word at a time while maintaining an internal memory of what it's already seen. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire input, one word at a time.

2. **Long Short-Term Memory:**In a simple RNN, the timestamp of the previous outputs and inputs gradually fade. But LSTM allows us to access any output at any point of time.

3. **Sentiment Analysis:** the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

### Movie Review System

Step 1: Setup

In [4]:
import tensorflow as tf
from keras.datasets import imdb
from keras.preprocessing import sequence
import os
import numpy as np
%tensorflow_version 2.x


Step 2: Load Dataset

In [11]:
(train,trainlabel),(test,testlabel) = imdb.load_data(num_words=88584)

In [12]:
x = train[1]
print(len(test))

25000


Step 3: Trimming reviews to max 250 words

In [13]:
train = sequence.pad_sequences(train,250)
test = sequence.pad_sequences(test,250)

Step 4: Creating the Model

In [16]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(88584,32),#an embedded layer of voabulary size and 32 dimensions of vectors
    tf.keras.layers.LSTM(32),#LSTM layer for every dimension
    tf.keras.layers.Dense(1,activation='sigmoid')
])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 32)          2834688   
                                                                 
 lstm_1 (LSTM)               (None, 32)                8320      
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


Step 5: Training the Model

In [24]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])

history = model.fit(train, trainlabel, epochs=1, validation_split=0.2)




Step 6: Evaluating Results

In [25]:
res = model.evaluate(test,testlabel)
print(res)

[0.29951250553131104, 0.8787199854850769]


Making Predictions

Step 1: Encoding Data

In [34]:
word_index = imdb.get_word_index()#returns the encoded data of every word

def encodingtxt(text):
  tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return sequence.pad_sequences([tokens],250)

text = "that movie was just amazing, so amazing"
encoded = encodingtxt(text)[0]
print(encoded)
  

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

Step 2: Decoding Data

In [36]:
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode(inte):
  p=0
  text = ""
  for n in inte:
    if n != p:
      text+=reverse_word_index[n] + " "
    
  return text

print(decode(encoded))

that movie was just amazing so amazing 


Step 3: Prediction

In [48]:
def predict(text):
  et = encodingtxt(text)
  pred = np.zeros((1,250))
  pred[0] = et
  result = model.predict(pred)
  print(result[0])

positive_review = "That movie was! really loved it and would great watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.8745984]
[0.5387743]


We can infer that higher the value, the more positive the review is. 

##RNN Play Generator

Now time for one of the coolest examples we've seen so far. We are going to use a RNN to generate a play. We will simply show the RNN an example of something we want it to recreate and it will learn how to write a version of it on its own. We'll do this using a character predictive model that will take as input a variable length sequence and predict the next character. We can use the model many times in a row with the output from the last predicition as the input for the next call to generate a sequence.


*This guide is based on the following: https://www.tensorflow.org/tutorials/text/text_generation*

In [None]:
%tensorflow_version 2.x  # this line is not required unless you are in a notebook
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `2.x  # this line is not required unless you are in a notebook`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


###Dataset
For this example, we only need one peice of training data. In fact, we can write our own poem or play and pass that to the network for training if we'd like. However, to make things easy we'll use an extract from a shakesphere play.




In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


###Loading Your Own Data
To load your own data, you'll need to upload a file from the dialog below. Then you'll need to follow the steps from above but load in this new file instead.



In [None]:
from google.colab import files
path_to_file = list(files.upload().keys())[0]

KeyboardInterrupt: ignored

###Read Contents of File
Let's look at the contents of the file.

In [None]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [None]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



###Encoding
Since this text isn't encoded yet well need to do that ourselves. We are going to encode each unique character as a different integer.



In [None]:
vocab = sorted(set(text))
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [None]:
# lets look at how part of our text is encoded
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


And here we will make a function that can convert our numeric values to text.


In [None]:
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


###Creating Training Examples
Remember our task is to feed the model a sequence and have it return to us the next character. This means we need to split our text data from above into many shorter sequences that we can pass to the model as training examples. 

The training examples we will prepapre will use a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to the right. For example:

```input: Hell | output: ello```

Our first step will be to create a stream of characters from our text data.

In [None]:
seq_length = 100  # length of sequence for a training example
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

Next we can use the batch method to turn this stream of characters into batches of desired length.

In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

Now we need to use these sequences of length 101 and split them into input and output.

In [None]:
def split_input_target(chunk):  # for the example: hello
    input_text = chunk[:-1]  # hell
    target_text = chunk[1:]  # ello
    return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)  # we use map to apply the above function to every entry

In [None]:
for x, y in dataset.take(2):
  print("\n\nEXAMPLE\n")
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


Finally we need to make training batches.

In [None]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)  # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

###Building the Model
Now it is time to build the model. We will use an embedding layer a LSTM and one dense layer that contains a node for each unique character in our training data. The dense layer will give us a probability distribution over all nodes.

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE,EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           16640     
                                                                 
 lstm (LSTM)                 (64, None, 1024)          5246976   
                                                                 
 dense (Dense)               (64, None, 65)            66625     
                                                                 
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


###Creating a Loss Function
Now we are going to create our own loss function for this problem. This is because our model will output a (64, sequence_length, 65) shaped tensor that represents the probability distribution of each character at each timestep for every sequence in the batch. 



However, before we do that let's have a look at a sample input and the output from our untrained model. This is so we can understand what the model is giving us.



In [None]:
for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch)  # ask our model for a prediction on our first batch of training data (64 entries)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")  # print out the output shape

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [None]:
# we can see that the predicition is an array of 64 arrays, one for each entry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[ 3.56389582e-03  3.84643790e-03  4.36210865e-03 ...  2.09503016e-03
    1.03878300e-03 -2.43377802e-03]
  [ 3.82107962e-03  8.01369548e-03  8.00206885e-03 ...  4.06005094e-03
    1.92892144e-03 -5.41026331e-03]
  [ 2.87781702e-03  5.13960654e-03  6.40344806e-03 ...  8.36855453e-03
    2.48497957e-03 -7.00188335e-03]
  ...
  [ 3.79926711e-03  3.31408950e-03  3.17737786e-03 ... -8.15140735e-03
    3.38351820e-04 -3.57806520e-03]
  [ 2.95986910e-03  5.62549941e-03 -8.47559422e-04 ... -2.53885658e-03
    3.69712873e-03 -4.19974327e-03]
  [ 5.98701974e-03  6.93688728e-03  2.49896268e-03 ... -9.79168341e-04
    3.23791686e-03 -5.31033473e-03]]

 [[-1.76456451e-04  2.56767194e-03  6.72089774e-03 ... -3.97190638e-03
   -3.17759556e-03  1.21498865e-03]
  [-3.23049433e-04  6.27466710e-03  2.73717334e-03 ...  1.09674083e-03
    2.49874219e-03 -1.19576510e-03]
  [ 2.43342388e-03  2.81624240e-03 -3.37624713e-03 ...  1.14233990e-03
   -4.35002474e-03 -5.08997682e-03]
  ...
  [ 4.961

In [None]:
# lets examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

100
tf.Tensor(
[[ 0.0035639   0.00384644  0.00436211 ...  0.00209503  0.00103878
  -0.00243378]
 [ 0.00382108  0.0080137   0.00800207 ...  0.00406005  0.00192892
  -0.00541026]
 [ 0.00287782  0.00513961  0.00640345 ...  0.00836855  0.00248498
  -0.00700188]
 ...
 [ 0.00379927  0.00331409  0.00317738 ... -0.00815141  0.00033835
  -0.00357807]
 [ 0.00295987  0.0056255  -0.00084756 ... -0.00253886  0.00369713
  -0.00419974]
 [ 0.00598702  0.00693689  0.00249896 ... -0.00097917  0.00323792
  -0.00531033]], shape=(100, 65), dtype=float32)


In [None]:
# and finally well look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# and of course its 65 values representing the probabillity of each character occuring next

In [None]:
# If we want to determine the predicted character we need to sample the output distribution (pick a value based on probabillity)
sampled_indices = tf.random.categorical(pred, num_samples=1)

# now we can reshape that array and convert all the integers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars  # and this is what the model predicted for training sequence 1

"'-3UxHf;jOCwuqAgEyq'?G::?,typwLO-GnSQuW\nIJCVI!K:ovVMc!a.uU-yo$fk&OVDffOQPo.$hdtbZs&TRrRq!CSqUdC3XROD"

So now we need to create a loss function that can compare that output to the expected output and give us some numeric value representing how close the two were. 

In [None]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

###Compiling the Model
At this point we can think of our problem as a classification problem where the model predicts the probabillity of each unique letter coming next. 


In [None]:
model.compile(optimizer='adam', loss=loss)

###Creating Checkpoints
Now we are going to setup and configure our model to save checkpoinst as it trains. This will allow us to load our model from a checkpoint and continue training it.

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

###Training
Finally, we will start training the model. 

**If this is taking a while go to Runtime > Change Runtime Type and choose "GPU" under hardware accelerator.**



In [None]:
history = model.fit(data, epochs=1, callbacks=[checkpoint_callback])



###Loading the Model
We'll rebuild the model from a checkpoint using a batch_size of 1 so that we can feed one peice of text to the model and have it make a prediction.

In [None]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

Once the model is finished training, we can find the **lastest checkpoint** that stores the models weights using the following line.



In [None]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

We can load **any checkpoint** we want by specifying the exact file to load.

In [None]:
checkpoint_num = 10
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
model.build(tf.TensorShape([1, None]))

NotFoundError: ignored

###Generating Text
Now we can use the lovely function provided by tensorflow to generate some text using any starting string we'd like.

In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 800

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
    
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [None]:
inp = input("Type a starting string: ")
print(generate_text(model, inp))

Type a starting string: hello world what a ifefekksbg
hello world what a ifefekksbgates.
Forse, my youthery beyom on.
I,, therewere muctle, my londs
Of was no dourt trields op Lifut, your greature of jught my his lines
Tran aff choughteress spracked, by that frilfer. Gullan.

First Serviord:
Which in of the erech,
Is mouse which all her, what more be, and great more commer,
's faiture merryess kins.

SICANPI:
Fee, comply:
For thee keach perave good heald.

Persen,
Your forewly: ell, infore lace, nerchire. foicy
the thee thou are brookes prate,
For the pirgares hopsesces misemes and howein
That carmed.

PETENGUS:
My would thou maves an hearind and agciser,
Hen, I no Dook morous noghe, from not
Kich hairsuld man me, net a vertious, what toos
Ke wornd the lost.

COMIZLAAD:
What, craintle more and exs.

HORTENSIO:
My courses.

Word mencher:
Stood monegt, and did now in eyein


*And* that's pretty much it for this module! I highly reccomend messing with the model we just created and seeing what you can get it to do!