<a href="https://colab.research.google.com/github/mishra-yogendra/Pytorch/blob/main/pytorch_lstm_next_word_predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install nltk




In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
import nltk

In [4]:
document = """About the Program
What is the course fee for the AI & Analytics Career Accelerator (AACA 2024)?
The program is billed quarterly. Each quarter costs USD 149, so the full 12-month journey is 149 × 4 = USD 596 (approx.).

What is the total duration of the course?
12 months, split into four themed quarters.

What is the syllabus of the accelerator?
Quarter 1 – Python & Jupyter Essentials
Quarter 2 – Data Wrangling, EDA & Storytelling
Quarter 3 – Applied Statistics & Predictive Modeling
Quarter 4 – Deploying Models, Cloud Pipelines & Business Cases
Detailed week-by-week curriculum: https://skillnova.com/aaca-syllabus

Will Deep Learning, NLP or Computer Vision be covered?
Yes, all three are introduced in Quarter 4 with hands-on labs.

What if I miss a live workshop?
Every session is auto-recorded and uploaded within 2 hours; you can watch at 1.25× or download audio for offline listening.

Where can I find the class calendar?
Live calendar (updated daily): https://calendar.skillnova.com/aaca2024

How long is each live workshop?
90 minutes on average, plus 30 minutes open Q&A.

What language is used?
English only; subtitles in Spanish and Portuguese are provided within 24 h.

How will I know a session is coming?
Telegram bot + email reminder 24 h and 1 h before every live event.

Can I enroll without a tech background?
Yes, a 2-week “Zero-to-Code” pre-bootcamp is included free.

I’m late—can I join mid-quarter?
Yes, you can start on the 1st or 15th of any month; you’ll receive links to catch-up material immediately.

If I join late, do I get access to earlier quarters?
Yes, the moment you pay you unlock all previously released content.

Where do I submit assignments?
Inside the SkillNova portal; you receive instant auto-grading + expert feedback within 48 h.

Are real company case-studies used?
Yes, each quarter ends with a 1-week capstone sourced from partner firms (fintech, e-commerce, logistics).

How do I reach support?
Email: hello@skillnova.com or Discord channel (average response 11 minutes).

Payment / Registration
Where do I pay?
Only on our secure portal: https://skillnova.com/checkout

Can I pay the full USD 596 upfront?
Yes, and you save 10 % (final price USD 536).

What is the validity of each quarterly payment?
90 days from the day you pay, not calendar quarters.

Refund policy?
14-day no-questions-asked refund for each quarter.

I live in a country without Stripe—what now?
Message us on Discord; we accept Wise, PayPal or local bank transfer.

Post-registration
How long can I re-watch videos?
While your quarterly license is active you have unlimited replays. After you complete all four payments you receive lifetime access to the 2024 edition recordings.

Why not lifetime from day one?
Continuous cloud GPU labs and updated datasets incur ongoing cost.

Where do I ask doubts?
Built-in “Raise Hand” button; mentors hold daily 60 min open rooms at 07:00 and 19:00 UTC.

Can I ask questions about old modules?
Yes, simply tag the week number when you post.

Certificate & Career Services
Certificate requirements?
1) Finish all four quarterly payments.
2) Score ≥ 70 % on each capstone.
3) Complete the career-portfolio track (résumé + GitHub audit).

Placement assistance contents
- 6 live résumé clinics
- Mock interviews with FAANG volunteers
- Curated job-board access (scraped daily)
- Salary-negotiation workshop
Note: this is assistance, not a job guarantee. No interviews are promised.
"""

In [5]:
# Tokenization
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
token = word_tokenize(document.lower())

In [7]:
# build vocab
vocab = {'<unk>':0}

for token in Counter(token).keys():
  if token not in vocab:
    vocab[token] = len(vocab)

vocab

{'<unk>': 0,
 'about': 1,
 'the': 2,
 'program': 3,
 'what': 4,
 'is': 5,
 'course': 6,
 'fee': 7,
 'for': 8,
 'ai': 9,
 '&': 10,
 'analytics': 11,
 'career': 12,
 'accelerator': 13,
 '(': 14,
 'aaca': 15,
 '2024': 16,
 ')': 17,
 '?': 18,
 'billed': 19,
 'quarterly': 20,
 '.': 21,
 'each': 22,
 'quarter': 23,
 'costs': 24,
 'usd': 25,
 '149': 26,
 ',': 27,
 'so': 28,
 'full': 29,
 '12-month': 30,
 'journey': 31,
 '×': 32,
 '4': 33,
 '=': 34,
 '596': 35,
 'approx.': 36,
 'total': 37,
 'duration': 38,
 'of': 39,
 '12': 40,
 'months': 41,
 'split': 42,
 'into': 43,
 'four': 44,
 'themed': 45,
 'quarters': 46,
 'syllabus': 47,
 '1': 48,
 '–': 49,
 'python': 50,
 'jupyter': 51,
 'essentials': 52,
 '2': 53,
 'data': 54,
 'wrangling': 55,
 'eda': 56,
 'storytelling': 57,
 '3': 58,
 'applied': 59,
 'statistics': 60,
 'predictive': 61,
 'modeling': 62,
 'deploying': 63,
 'models': 64,
 'cloud': 65,
 'pipelines': 66,
 'business': 67,
 'cases': 68,
 'detailed': 69,
 'week-by-week': 70,
 'curricul

In [8]:
len(vocab)

336

In [16]:
input_sentences = document.split('\n')

In [17]:
def text_to_indices(sentence, vocab):

  numerical_sentence = []

  for token in sentence:
    if token in vocab:
      numerical_sentence.append(vocab[token])
    else:
      numerical_sentence.append(vocab['<unk>'])

  return numerical_sentence

In [18]:
input_numerical_sentences = []

for sentence in input_sentences:
  input_numerical_sentences.append(text_to_indices(word_tokenize(sentence.lower()), vocab))

In [19]:
len(input_numerical_sentences)

92

In [20]:
training_sequence = []
for sentence in input_numerical_sentences:

  for i in range(1, len(sentence)):
    training_sequence.append(sentence[:i+1])

In [21]:
len(training_sequence)

606

In [22]:
training_sequence[:5]

[[1, 2], [1, 2, 3], [4, 5], [4, 5, 2], [4, 5, 2, 6]]

In [23]:
len_list = []

for sequence in training_sequence:
  len_list.append(len(sequence))

max(len_list)

29

In [24]:
padded_training_sequence = []
for sequence in training_sequence:

  padded_training_sequence.append([0]*(max(len_list) - len(sequence)) + sequence)

In [25]:
len(padded_training_sequence[10])

29

In [26]:
padded_training_sequence = torch.tensor(padded_training_sequence, dtype=torch.long)

In [27]:
padded_training_sequence

tensor([[  0,   0,   0,  ...,   0,   1,   2],
        [  0,   0,   0,  ...,   1,   2,   3],
        [  0,   0,   0,  ...,   0,   4,   5],
        ...,
        [  0,   0,   0,  ..., 334, 323,  87],
        [  0,   0,   0,  ..., 323,  87, 335],
        [  0,   0,   0,  ...,  87, 335,  21]])

In [28]:
X = padded_training_sequence[:, :-1]
y = padded_training_sequence[:,-1]

In [29]:
X

tensor([[  0,   0,   0,  ...,   0,   0,   1],
        [  0,   0,   0,  ...,   0,   1,   2],
        [  0,   0,   0,  ...,   0,   0,   4],
        ...,
        [  0,   0,   0,  ...,  21, 334, 323],
        [  0,   0,   0,  ..., 334, 323,  87],
        [  0,   0,   0,  ..., 323,  87, 335]])

In [30]:
y

tensor([  2,   3,   5,   2,   6,   7,   8,   2,   9,  10,  11,  12,  13,  14,
         15,  16,  17,  18,   3,   5,  19,  20,  21,  22,  23,  24,  25,  26,
         27,  28,   2,  29,  30,  31,   5,  26,  32,  33,  34,  25,  35,  14,
          0,  21,  17,  21,   5,   2,  37,  38,  39,   2,   6,  18,  41,  27,
         42,  43,  44,  45,  46,  21,   5,   2,  47,  39,   2,  13,  18,  48,
         49,  50,  10,  51,  52,  53,  49,  54,  55,  27,  56,  10,  57,  58,
         49,  59,  60,  10,  61,  62,  33,  49,  63,  64,  27,  65,  66,  10,
         67,  68,  70,  71,  72,  73,  72,  74,  76,  77,  27,  78,  79,  80,
         81,  82,  83,  18,  27,  85,  86,  87,  88,  89,  23,  33,  90,  91,
         92,  21,  93,  94,  95,  96,  97,  98,  18, 100,   5, 101, 102, 103,
        104,  53, 105, 106, 107, 108, 109, 110, 111,  79, 112, 113,   8, 114,
        115,  21, 108,  94, 117,   2, 118, 119,  18, 119,  14, 120, 121,  17,
         72,  73,  72, 122, 124,   5,  22,  97,  98,  18, 126, 1

In [31]:
class CustomDataset(Dataset):

  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return self.X.shape[0]

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [32]:
dataset = CustomDataset(X,y)

In [33]:
len(dataset)

606

In [34]:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [35]:
class LSTMModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, 100)
    self.lstm = nn.LSTM(100, 150, batch_first=True)
    self.fc = nn.Linear(150, vocab_size)

  def forward(self, x):
    embedded = self.embedding(x)
    intermediate_hidden_states, (final_hidden_state, final_cell_state) = self.lstm(embedded)
    output = self.fc(final_hidden_state.squeeze(0))
    return output

In [36]:
model = LSTMModel(len(vocab))

In [37]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [38]:
model.to(device)

LSTMModel(
  (embedding): Embedding(336, 100)
  (lstm): LSTM(100, 150, batch_first=True)
  (fc): Linear(in_features=150, out_features=336, bias=True)
)

In [39]:
epochs = 50
learning_rate = 0.001

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [40]:
# training loop

for epoch in range(epochs):
  total_loss = 0

  for batch_x, batch_y in dataloader:

    batch_x, batch_y = batch_x.to(device), batch_y.to(device)

    optimizer.zero_grad()

    output = model(batch_x)

    loss = criterion(output, batch_y)

    loss.backward()

    optimizer.step()

    total_loss = total_loss + loss.item()

  print(f"Epoch: {epoch + 1}, Loss: {total_loss:.4f}")

Epoch: 1, Loss: 110.1093
Epoch: 2, Loss: 103.8971
Epoch: 3, Loss: 97.1264
Epoch: 4, Loss: 90.2614
Epoch: 5, Loss: 83.5145
Epoch: 6, Loss: 76.3216
Epoch: 7, Loss: 69.2808
Epoch: 8, Loss: 62.6098
Epoch: 9, Loss: 55.9717
Epoch: 10, Loss: 49.9605
Epoch: 11, Loss: 44.1644
Epoch: 12, Loss: 39.1765
Epoch: 13, Loss: 34.5541
Epoch: 14, Loss: 30.4254
Epoch: 15, Loss: 26.7595
Epoch: 16, Loss: 23.3622
Epoch: 17, Loss: 20.5141
Epoch: 18, Loss: 18.1171
Epoch: 19, Loss: 15.9691
Epoch: 20, Loss: 14.1853
Epoch: 21, Loss: 12.6605
Epoch: 22, Loss: 11.3068
Epoch: 23, Loss: 10.2096
Epoch: 24, Loss: 9.3027
Epoch: 25, Loss: 8.5422
Epoch: 26, Loss: 7.8650
Epoch: 27, Loss: 7.2964
Epoch: 28, Loss: 6.7820
Epoch: 29, Loss: 6.3359
Epoch: 30, Loss: 6.0159
Epoch: 31, Loss: 5.6506
Epoch: 32, Loss: 5.3554
Epoch: 33, Loss: 5.1168
Epoch: 34, Loss: 4.8648
Epoch: 35, Loss: 4.6360
Epoch: 36, Loss: 4.4991
Epoch: 37, Loss: 4.2975
Epoch: 38, Loss: 4.1528
Epoch: 39, Loss: 4.0413
Epoch: 40, Loss: 3.8856
Epoch: 41, Loss: 3.8468


In [41]:
# prediction

def prediction(model, vocab, text):

  # tokenize
  tokenized_text = word_tokenize(text.lower())

  # text -> numerical indices
  numerical_text = text_to_indices(tokenized_text, vocab)

  # padding
  padded_text = torch.tensor([0] * (61 - len(numerical_text)) + numerical_text, dtype=torch.long).unsqueeze(0)

  # send to model
  output = model(padded_text)

  # predicted index
  value, index = torch.max(output, dim=1)

  # merge with text
  return text + " " + list(vocab.keys())[index]



In [43]:
prediction(model, vocab, "Can I pay the full USD 596 ")

'Can I pay the full USD 596  upfront'

In [45]:
import time

num_tokens = 10
input_text = "While your"

for i in range(num_tokens):
  output_text = prediction(model, vocab, input_text)
  print(output_text)
  input_text = output_text
  time.sleep(0.5)

While your quarterly
While your quarterly license
While your quarterly license is
While your quarterly license is active
While your quarterly license is active you
While your quarterly license is active you have
While your quarterly license is active you have unlimited
While your quarterly license is active you have unlimited replays
While your quarterly license is active you have unlimited replays .
While your quarterly license is active you have unlimited replays . after


In [47]:
# Function to calculate accuracy
def calculate_accuracy(model, dataloader, device):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():  # No need to compute gradients
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)

            # Get model predictions
            outputs = model(batch_x)

            # Get the predicted word indices
            _, predicted = torch.max(outputs, dim=1)

            # Compare with actual labels
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

    accuracy = correct / total * 100
    return accuracy

# Compute accuracy
accuracy = calculate_accuracy(model, dataloader, device)
print(f"Model Accuracy: {accuracy:.2f}%")


Model Accuracy: 95.38%
