<a href="https://colab.research.google.com/github/pragyasingh1729/DeepLearningProjects/blob/main/LSTM_wordNextPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Artificial Neural Network (ANN)

So far, we were working with ANN which is a sophisticated computational model inspired by the workings of the human brain. It consists of layers of interconnected nodes, known as neurons, which work together to process and interpret complex data.

Each neuron in the network functions as a mathematical unit that receives input, processes it, and passes the output to the next layer of neurons. This structure typically includes an input layer, one or more hidden layers, and an output layer.

Through a process called training, where the weights of the connections between neurons are adjusted, ANNs can learn from data, recognize patterns, and make predictions. This capability makes them invaluable in a wide range of applications, such as image and speech recognition, natural language processing, and predictive analytics, enabling advancements in technology and artificial intelligence.


## Recurrent Neural Network

### Why do we need RNN?
- Fixed number of input and output in ANN
- What if we fix the size of input by padding? Then it leads to next problem that is ANN does not take input sequence into consideration

### Simple understanding of RNN
RNNs have a recurrent connection that allows them to maintain a memory of previous inputs, making them well-suited for tasks involving sequential data such as time series prediction, natural language processing, and speech recognition.

Here's a simple understanding of Recurrent Neural Networks (RNNs). Imagine you're reading a story, and you need to understand each sentence based on what you've read so far. That's what an RNN does - it reads data one part at a time and remembers what it's seen before.

**How it works:**

1. **Taking Input**: Just like reading a story, an RNN takes data (like words or numbers) one step at a time.
   
2. **Remembering**: As it reads each part, it keeps a memory of what it's seen so far. This memory helps it understand the current part in the context of what came before.

3. **Predicting**: Once it's read all the parts, it can make predictions or decisions based on what it's learned from the whole sequence.

**Example:**
Imagine you're predicting the next word in a sentence. With each word you read, the RNN updates its memory to understand what might come next. So if you've read "The cat is", it might predict "sleeping" as the next word because it knows "The cat is" often followed by "sleeping".

**In Summary:**
An RNN is like a smart reader that understands not just the current part of the story, but how it fits into the whole tale.

"Recurrent" in RNNs highlights their ability to process sequential data by repeatedly applying the same computation across different time steps, incorporating information from previous steps into the current computation.

### Different mapping
In the world of sequences, how information moves from one point to another matters a lot. Let's break down three different input output mappping with an example
- **one to many** - write a caption for an input image
- **many to one** - take review of a movie as an input and give rating as output
- **many to many** - language translation

### Long Short Term Memory
- **RNNs and Vanishing Gradient Problem**:
  - RNNs (Recurrent Neural Networks) struggle with long sequential data due to the vanishing gradient problem, where gradients become too small, hindering learning over long time steps.

- **LSTM Architecture**:
  - LSTM (Long Short-Term Memory) networks address this issue with a unique architecture that maintains long-term and short-term memory.
  - This is achieved using three gates that regulate the flow of information:

    - **Forget Gate**: Decides what information to discard from the long-term state. It uses a sigmoid activation function to produce a value between 0 and 1, where 0 means "completely forget" and 1 means "completely keep."
  
    - **Input Gate**: Determines what new information to add to the cell state. It consists of a sigmoid layer (deciding which values to update) and a tanh layer (creating new candidate values to add).
  
    - **Output Gate**: Updates the hidden state, which is used for predictions and to inform the next time step. It filters the cell state through a sigmoid function and multiplies it by the tanh of the cell state.

- **Additional Points**:
  - **Cell State**: The cell state runs through the entire chain with only minor linear interactions, allowing the network to carry information across many time steps without significant loss.
  - **Hidden State**: The hidden state is updated at each time step and serves as a short-term memory that influences the cell state and the output.



# <font color="red"><b>Working with LSTM</b></font>



In [3]:
faqs = """About the Program
What is the course fee for  Data Science Mentorship Program (DSMP 2023)
The course follows a monthly subscription model where you have to make monthly payments of Rs 799/month.
What is the total duration of the course?
The total duration of the course is 7 months. So the total course fee becomes 799*7 = Rs 5600(approx.)
What is the syllabus of the mentorship program?
We will be covering the following modules:
Python Fundamentals
Python libraries for Data Science
Data Analysis
SQL for Data Science
Maths for Machine Learning
ML Algorithms
Practical ML
MLOPs
Case studies
You can check the detailed syllabus here - https://learnwith.campusx.in/courses/CampusX-Data-Science-Mentorship-Program-637339afe4b0615a1bbed390
Will Deep Learning and NLP be a part of this program?
No, NLP and Deep Learning both are not a part of this program’s curriculum.
What if I miss a live session? Will I get a recording of the session?
Yes all our sessions are recorded, so even if you miss a session you can go back and watch the recording.
Where can I find the class schedule?
Checkout this google sheet to see month by month time table of the course - https://docs.google.com/spreadsheets/d/16OoTax_A6ORAeCg4emgexhqqPv3noQPYKU7RJ6ArOzk/edit?usp=sharing.
What is the time duration of all the live sessions?
Roughly, all the sessions last 2 hours.
What is the language spoken by the instructor during the sessions?
Hinglish
How will I be informed about the upcoming class?
You will get a mail from our side before every paid session once you become a paid user.
Can I do this course if I am from a non-tech background?
Yes, absolutely.
I am late, can I join the program in the middle?
Absolutely, you can join the program anytime.
If I join/pay in the middle, will I be able to see all the past lectures?
Yes, once you make the payment you will be able to see all the past content in your dashboard.
Where do I have to submit the task?
You don’t have to submit the task. We will provide you with the solutions, you have to self evaluate the task yourself.
Will we do case studies in the program?
Yes.
Where can we contact you?
You can mail us at nitish.campusx@gmail.com
Payment/Registration related questions
Where do we have to make our payments? Your YouTube channel or website?
You have to make all your monthly payments on our website. Here is the link for our website - https://learnwith.campusx.in/
Can we pay the entire amount of Rs 5600 all at once?
Unfortunately no, the program follows a monthly subscription model.
What is the validity of monthly subscription? Suppose if I pay on 15th Jan, then do I have to pay again on 1st Feb or 15th Feb
15th Feb. The validity period is 30 days from the day you make the payment. So essentially you can join anytime you don’t have to wait for a month to end.
What if I don’t like the course after making the payment. What is the refund policy?
You get a 7 days refund period from the day you have made the payment.
I am living outside India and I am not able to make the payment on the website, what should I do?
You have to contact us by sending a mail at nitish.campusx@gmail.com
Post registration queries
Till when can I view the paid videos on the website?
This one is tricky, so read carefully. You can watch the videos till your subscription is valid. Suppose you have purchased subscription on 21st Jan, you will be able to watch all the past paid sessions in the period of 21st Jan to 20th Feb. But after 21st Feb you will have to purchase the subscription again.
But once the course is over and you have paid us Rs 5600(or 7 installments of Rs 799) you will be able to watch the paid sessions till Aug 2024.
Why lifetime validity is not provided?
Because of the low course fee.
Where can I reach out in case of a doubt after the session?
You will have to fill a google form provided in your dashboard and our team will contact you for a 1 on 1 doubt clearance session
If I join the program late, can I still ask past week doubts?
Yes, just select past week doubt in the doubt clearance google form.
I am living outside India and I am not able to make the payment on the website, what should I do?
You have to contact us by sending a mail at nitish.campusx@gmai.com
Certificate and Placement Assistance related queries
What is the criteria to get the certificate?
There are 2 criterias:
You have to pay the entire fee of Rs 5600
You have to attempt all the course assessments.
I am joining late. How can I pay payment of the earlier months?
You will get a link to pay fee of earlier months in your dashboard once you pay for the current month.
I have read that Placement assistance is a part of this program. What comes under Placement assistance?
This is to clarify that Placement assistance does not mean Placement guarantee. So we dont guarantee you any jobs or for that matter even interview calls. So if you are planning to join this course just for placements, I am afraid you will be disappointed. Here is what comes under placement assistance
Portfolio Building sessions
Soft skill sessions
Sessions with industry mentors
Discussion on Job hunting strategies
"""

## Steps to predict the next word using LSTM
- Tokenizer: to convert the raw text into sequence to integers
  - first tokenized all the words in the given input
  - use the token to convert the sentence into list of token

- now we want to go word-by-word and get the input and target (next word which will come in the sentence)

- the list of input and output will vary in size, so use the padding to get the same length

- overall the problem would be `classification problem` as the target would be the integer value associated to the word in the input directory

 ### Tokenizer
 It is designed to transform raw text into sequences of integers, which can then be fed into neural network models.

In [4]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer


In [5]:
tokenizer = Tokenizer()

In [6]:
tokenizer.fit_on_texts([faqs])

In [7]:
tokenizer.word_index

{'the': 1,
 'you': 2,
 'i': 3,
 'to': 4,
 'a': 5,
 'of': 6,
 'is': 7,
 'have': 8,
 'will': 9,
 'can': 10,
 'what': 11,
 'course': 12,
 'program': 13,
 'in': 14,
 'for': 15,
 'all': 16,
 'sessions': 17,
 'on': 18,
 'be': 19,
 'and': 20,
 'this': 21,
 'if': 22,
 'am': 23,
 'pay': 24,
 'payment': 25,
 'make': 26,
 'we': 27,
 'do': 28,
 'subscription': 29,
 'where': 30,
 'rs': 31,
 'so': 32,
 'campusx': 33,
 'session': 34,
 'our': 35,
 'paid': 36,
 'join': 37,
 'able': 38,
 'your': 39,
 'website': 40,
 'placement': 41,
 'fee': 42,
 'data': 43,
 'monthly': 44,
 'month': 45,
 'not': 46,
 'get': 47,
 'yes': 48,
 'once': 49,
 'past': 50,
 'feb': 51,
 'assistance': 52,
 'science': 53,
 '7': 54,
 '5600': 55,
 'are': 56,
 'watch': 57,
 'google': 58,
 'by': 59,
 'com': 60,
 'mail': 61,
 'from': 62,
 'contact': 63,
 'us': 64,
 'at': 65,
 'or': 66,
 'doubt': 67,
 'mentorship': 68,
 'payments': 69,
 '799': 70,
 'total': 71,
 'duration': 72,
 'months': 73,
 'learning': 74,
 'case': 75,
 'here': 76,
 '

In [8]:
input_sequence = []
for sentence in faqs.split('\n'):
  tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]

  for i in range(1, len(tokenized_sentence)):
    n_gram = tokenized_sentence[:i+1]
    input_sequence.append(n_gram)

In [9]:
input_sequence[:5]

[[93, 1], [93, 1, 13], [11, 7], [11, 7, 1], [11, 7, 1, 12]]

As the above list has different length, we will pad it to have similar length

In [10]:
max_length = max(len(x) for x in input_sequence)

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [12]:
padded_input_sequences = pad_sequences(input_sequence, maxlen = max_length, padding = 'pre')

In [13]:
X = padded_input_sequences[:, :-1]
Y = padded_input_sequences[:,-1]

In [14]:
print(X.shape, Y.shape)

(863, 56) (863,)


Since its a multi-classification problem, we want to get the probability of the most possible word which will come next in the sentence. In order to do so, we will look into our vocabulary and use on-hot encoding on it

In [15]:
from tensorflow.keras.utils import to_categorical

In [21]:
num_classes = len(tokenizer.word_index)
print(num_classes)

282


In [19]:
Y = to_categorical(Y, num_classes + 1) # plus one Tokenizer indices start from 1\,
# but one-hot encoding arrays are zero-indexed. Adding 1 to num_classes ensures all indices from 0 to the maximum index are covered.

In [20]:
Y.shape

(863, 283)

### Model

We use the embedding layer to convert the sparse matrix (X) into dense matrix.  The Embedding layer transforms integer-encoded words into dense, meaningful vectors that capture semantic relationships, reducing dimensionality and improving the efficiency and effectiveness of NLP models.








In [28]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [24]:
## 100: The dimension of the dense embedding. Each word will be represented as a 100-dimensional vector. input_length=56: Each input sequence has 56 tokens.

model = Sequential([
    Embedding(283, 100, input_length = 56),
    LSTM(150), # number of neurons
    Dense(283, activation = 'softmax')

])

In [29]:
model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ['accuracy'])

In [30]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 56, 100)           28300     
                                                                 
 lstm (LSTM)                 (None, 150)               150600    
                                                                 
 dense (Dense)               (None, 283)               42733     
                                                                 
Total params: 221633 (865.75 KB)
Trainable params: 221633 (865.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [32]:
model.fit(X, Y, epochs = 500, verbose = 0)

<keras.src.callbacks.History at 0x7adc92123310>

In [37]:
import numpy as np

## Check the model

In [39]:
text = 'mail'

## tokenize
token_text = tokenizer.texts_to_sequences([text])[0]
## padding
padded_token = pad_sequences([token_text], maxlen = 56, padding = 'pre')

## predict will give a list of 283 shape, probability for each word which can come next in the sentence
position = np.argmax(model.predict(padded_token))



In [40]:
position

21

In [41]:
for word, index in tokenizer.word_index.items():
  if index == position:
    print(word)

this


## Printing the whole sentence based on a word

In [45]:
text = 'mail'

for i in range(10):

  token_text = tokenizer.texts_to_sequences([text])[0]

  padded_token = pad_sequences([token_text], maxlen = 56, padding = 'pre')

  position = np.argmax(model.predict(padded_token))

  for word, index in tokenizer.word_index.items():
    if index == position:
      text = text + ' ' + word
      print(text)

mail this
mail this google
mail this google sheet
mail this google sheet to
mail this google sheet to see
mail this google sheet to see month
mail this google sheet to see month by
mail this google sheet to see month by month
mail this google sheet to see month by month time
mail this google sheet to see month by month time table
