In [1]:
!pip install nltk



In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
import nltk

In [3]:
document = """About the Program
This program follows a monthly subscription model at Rs 799 per month.
The total duration of the DSMP is seven months.
The approximate overall fee across seven months is Rs 5600.
You can join anytime without waiting for a new batch to start.
All live sessions are recorded for later viewing.
Session recordings appear in your dashboard after class.
Typical live sessions run for about two hours.
The teaching language used in class is Hinglish.
Non-technical learners are welcome to enroll and start from basics.
Joining mid-way still grants access to past recordings during validity.

Syllabus Overview
The syllabus covers Python Fundamentals for complete beginners.
You will learn Python libraries used in Data Science work.
Data Analysis modules focus on EDA and storytelling with data.
SQL for Data Science includes joins and window functions.
Maths for Machine Learning keeps intuition first and formulas second.
ML Algorithms include supervised and unsupervised techniques.
Practical ML focuses on pipelines, validation, and feature engineering.
MLOps introduces experiment tracking, packaging, and simple CI/CD.
Case studies simulate realistic business problems and decisions.
Deep Learning and NLP are not part of this curriculum.

Links and Resources
You can check detailed syllabus on the official course page.
The monthly timetable is shared through a public Google Sheet.
Official payments are made only through the course website.
Reminder emails are sent before each paid session.
Dashboard access shows recordings, notes, and the doubt form.

Live Sessions
If you miss a session, you can watch the recording later.
Most sessions last roughly one hundred and twenty minutes.
Slides are primarily in English with Hinglish explanations.
Q&A time is reserved at the end of the session.
Weekly schedules are announced on the shared sheet.

Access and Validity
Your subscription is valid for thirty days from purchase time.
Renewals extend your access window by another thirty days.
During an active cycle, past paid content remains unlocked.
If your validity lapses, you will need to renew to continue.
Joining on any date shifts your next renewal to that date.

Refund Policy
A seven-day refund window starts from your payment date.
Refunds requested after seven days are not eligible.
Plan your first week to evaluate fit before the window closes.
Contact support if you need help initiating a refund request.
Refunds apply to the latest payment within policy limits.

Payments
Monthly payments must be made on the official website.
Do not pay through third-party links or private messages.
Receipts are emailed after successful payment processing.
International payment issues can be escalated by email.
Include your registered email and phone number when writing support.

Eligibility and Onboarding
Beginners from non-tech backgrounds can join confidently.
The course starts from Python basics and builds gradually.
You can jump in mid-month and begin with recordings.
Your dashboard unlocks immediately after successful payment.
Orientation notes help you navigate the platform quickly.

Doubt Support
You can fill a doubt form through the dashboard.
The team schedules one-on-one clarity calls for complex issues.
Past-week doubts can still be raised using the form option.
Provide examples or screenshots to speed up resolution.
Response times are communicated after form submission.

Certificate Criteria
You must complete full fee payment across seven months.
You must attempt all course assessments to qualify.
Certificates are issued to learners who meet both criteria.
Assessments emphasize applied understanding over rote math.
Keep submission notes concise and clearly structured.

Placement Assistance
Placement assistance does not imply a placement guarantee.
Job offers or interviews are not guaranteed by the program.
Assistance includes portfolio building and resume guidance.
Soft-skill sessions improve communication and interviews.
Mentor sessions add real-world perspective and feedback.
Job hunting strategies cover ATS keywords and outreach.
You should expect guidance, not assured outcomes.

Content Emphasis
Python Fundamentals cover syntax, control flow, and functions.
Libraries for DS include NumPy, Pandas, and Matplotlib.
Data Analysis focuses on tidy data and reproducible EDA.
SQL practice builds confidence with joins and aggregates.
Maths for ML develops intuition for vectors and gradients.
Algorithms include linear models, trees, and clustering.
Practical ML stresses pipelines and validation rigor.
MLOps introduces tracking, packaging, and deployments.
Case studies connect methods to business decisions.

Recordings and Dashboard
Recordings are your safety net for missed classes.
Videos appear in the dashboard within the validity period.
Downloadable resources are attached where permitted.
You can rewatch tricky segments at your own pace.
Keep personal notes aligned to each module outcome.

Scheduling
Typical classes run in the evening IST schedule.
Exact start times are posted on the weekly sheet.
Reminder emails arrive before each live session.
Calendar links may be provided for convenience.
Check the sheet regularly for any timing updates.

Policies and Safety
Always pay only through the official website link.
Never share OTPs or passwords with anyone.
Support will never ask for your confidential info.
Use the listed emails for payment-related queries.
Verify URLs before completing a transaction.

International Learners
If cards fail, contact support for alternatives.
Share transaction error screenshots for faster help.
Confirm time zone differences for live sessions.
Recordings help when time zones are challenging.
Support provides guidance tailored to your region.

Learning Approach
Focus on clarity and reproducibility over complexity.
Start with baseline models before heavy tuning.
Use proper validation to avoid leakage pitfalls.
Document assumptions at the top of each notebook.
Prefer readable code and clear variable names.

Evaluation Style
Assessments are short and focused on application.
Rubrics reward reasoning and decision justification.
Error analysis is valued alongside metric scores.
Write concise summaries of your modeling choices.
Link metrics to practical business costs where possible.

Practical Tips
Pin library versions to stabilize your environment.
Seed randomness to reproduce key results consistently.
Keep datasets versioned as experiments progress.
Use checklists to reduce last-minute mistakes.
Save artifacts with consistent naming conventions.

SQL Module Highlights
Practice joins across fact and dimension tables.
Use window functions for rankings and rolling stats.
Write groupby summaries at meaningful aggregation levels.
Consider indexes and query plans for performance.
Write clear SQL with consistent formatting.

Pandas and Visualization
Indexing and selection patterns improve readability.
Groupby pipelines summarize behavior effectively.
Avoid chained operations when clarity suffers.
Label axes and titles for meaningful charts.
Choose appropriate scales for honest visuals.

Maths Essentials
Vectors and matrices are introduced with intuition.
Gradients are linked to simple geometric ideas.
Bias-variance tradeoff is explained with examples.
Probability basics support reasoning under uncertainty.
You learn enough math to use models responsibly.

ML Algorithms
Begin with linear and logistic regression baselines.
Move to trees and ensembles for nonlinear structure.
Try clustering for unsupervised pattern discovery.
Use cross-validation to compare model families.
Tune hyperparameters only after strong baselines.

MLOps Basics
Track experiments with simple run identifiers.
Package code for predictable training and inference.
Capture environment details for reproducibility.
Automate small checks in a lightweight CI step.
Log decisions and metrics for future audits.

Case Studies
Start with a crisp problem statement and success metric.
Explore data visually to form testable hypotheses.
Engineer features grounded in domain intuition.
Validate with appropriate temporal splits when needed.
Present tradeoffs and a pragmatic recommendation.

Communication
Use plain language in reports and presentations.
Prefer few clear plots over many noisy ones.
Explain why a metric was chosen for the task.
State assumptions and limitations openly.
Outline next steps with realistic timelines.

Enrollment Flexibility
Mid-month joins are supported by rolling validity.
Renewals occur thirty days after your payment date.
Access continues until the current cycle ends.
Rejoining later restores your dashboard promptly.
You can learn at a pace that fits your schedule.

Support Channels
Email support for payment or access issues.
Include registered email and phone in messages.
Attach relevant screenshots for context.
Expect confirmation and estimated response windows.
Escalations are available for unresolved cases.

What’s Not Included
Deep Learning is outside the current scope.
NLP topics are not covered in this program.
Placement guarantees are not offered by the team.
Lifetime access is not provided due to low fees.
Advanced research topics are out of scope here.

After Course Access
After completing seven payments you keep access until the stated end date.
Final access windows are communicated near course completion.
Policies may update; refer to official pages for changes.
Timelines align with the published DSMP cohort information.
Use the dashboard to check your current validity dates.

Study Habits
Block time after each class to review notes.
Rewatch complex parts of recordings at higher speed.
Practice SQL and Pandas daily for fluency.
Summarize each module in your own words.
Share doubts early through the form.

Quality and Integrity
Cite data sources when using external datasets.
Avoid leaking information across data splits.
Prefer interpretable baselines before complex stacks.
Use calibration where thresholds matter to outcomes.
Keep feedback loops open with mentors and peers.

Community and Mentorship
Portfolio sessions cover project curation and impact framing.
Soft-skill exercises include STAR answers and mock interviews.
Mentor talks reveal real constraints from industry work.
Networking tips center on targeted outreach and follow-ups.
Job strategies emphasize ATS alignment to job descriptions.

Admin Reminders
All payments are processed on the official website.
Refunds are handled only within the seven-day window.
Schedules are updated in the shared Google Sheet.
Official updates are sent via registered email.
Keep your profile details accurate in the dashboard.

Contact
For payments and access, write to nitish.campusx@gmail.com.
Use clear subject lines describing your issue briefly.
Include your registered email and transaction details.
Do not share sensitive information in emails.
Expect a response with next steps and timelines.

Wrapping Up
The DSMP focuses on solid DS foundations and practical ML.
You will learn Python, SQL, Maths, and core ML methods.
MLOps adds tracking and simple deployment discipline.
Case studies tie methods to decisions stakeholders care about.
Recordings, rolling validity, and guidance keep learning flexible."""


In [4]:
# Tokenization
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [5]:
# tokenize
tokens = word_tokenize(document.lower())

In [6]:
# build vocab
vocab = {'<unk>':0}

for token in Counter(tokens).keys():
  if token not in vocab:
    vocab[token] = len(vocab)

vocab

{'<unk>': 0,
 'about': 1,
 'the': 2,
 'program': 3,
 'this': 4,
 'follows': 5,
 'a': 6,
 'monthly': 7,
 'subscription': 8,
 'model': 9,
 'at': 10,
 'rs': 11,
 '799': 12,
 'per': 13,
 'month': 14,
 '.': 15,
 'total': 16,
 'duration': 17,
 'of': 18,
 'dsmp': 19,
 'is': 20,
 'seven': 21,
 'months': 22,
 'approximate': 23,
 'overall': 24,
 'fee': 25,
 'across': 26,
 '5600.': 27,
 'you': 28,
 'can': 29,
 'join': 30,
 'anytime': 31,
 'without': 32,
 'waiting': 33,
 'for': 34,
 'new': 35,
 'batch': 36,
 'to': 37,
 'start': 38,
 'all': 39,
 'live': 40,
 'sessions': 41,
 'are': 42,
 'recorded': 43,
 'later': 44,
 'viewing': 45,
 'session': 46,
 'recordings': 47,
 'appear': 48,
 'in': 49,
 'your': 50,
 'dashboard': 51,
 'after': 52,
 'class': 53,
 'typical': 54,
 'run': 55,
 'two': 56,
 'hours': 57,
 'teaching': 58,
 'language': 59,
 'used': 60,
 'hinglish': 61,
 'non-technical': 62,
 'learners': 63,
 'welcome': 64,
 'enroll': 65,
 'and': 66,
 'from': 67,
 'basics': 68,
 'joining': 69,
 'mid-way

In [7]:
len(vocab)

757

In [8]:
input_sentences = document.split('\n')

In [9]:
def text_to_indices(sentence, vocab):

  numerical_sentence = []

  for token in sentence:
    if token in vocab:
      numerical_sentence.append(vocab[token])
    else:
      numerical_sentence.append(vocab['<unk>'])

  return numerical_sentence


In [10]:
input_numerical_sentences = []

for sentence in input_sentences:
  input_numerical_sentences.append(text_to_indices(word_tokenize(sentence.lower()), vocab))


In [11]:
len(input_numerical_sentences)

267

In [12]:
training_sequence = []
for sentence in input_numerical_sentences:

  for i in range(1, len(sentence)):
    training_sequence.append(sentence[:i+1])

In [13]:
len(training_sequence)

1610

In [14]:
training_sequence[:5]

[[1, 2], [1, 2, 3], [4, 3], [4, 3, 5], [4, 3, 5, 6]]

In [15]:
len_list = []

for sequence in training_sequence:
  len_list.append(len(sequence))

max(len_list)

14

In [16]:
training_sequence[0]

[1, 2]

In [17]:
padded_training_sequence = []
for sequence in training_sequence:

  padded_training_sequence.append([0]*(max(len_list) - len(sequence)) + sequence)

In [18]:
len(padded_training_sequence[10])

14

In [19]:
padded_training_sequence = torch.tensor(padded_training_sequence, dtype=torch.long)

In [20]:
padded_training_sequence

tensor([[  0,   0,   0,  ...,   0,   1,   2],
        [  0,   0,   0,  ...,   1,   2,   3],
        [  0,   0,   0,  ...,   0,   4,   3],
        ...,
        [  0,   0,   0,  ..., 329, 313, 104],
        [  0,   0,   0,  ..., 313, 104, 756],
        [  0,   0,   0,  ..., 104, 756,  15]])

In [21]:
X = padded_training_sequence[:, :-1]
y = padded_training_sequence[:,-1]

In [None]:
X

tensor([[  0,   0,   0,  ...,   0,   0,   1],
        [  0,   0,   0,  ...,   0,   1,   2],
        [  0,   0,   0,  ...,   0,   0,   4],
        ...,
        [  0,   0,   0,  ...,   0, 285, 176],
        [  0,   0,   0,  ..., 285, 176, 286],
        [  0,   0,   0,  ..., 176, 286, 287]])

In [22]:
y

tensor([  2,   3,   3,  ..., 104, 756,  15])

In [24]:
class CustomDataset(Dataset):

  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return self.X.shape[0]

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [25]:
dataset = CustomDataset(X,y)

In [26]:
len(dataset)

1610

In [27]:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [28]:
class LSTMModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, 100)
    self.lstm = nn.LSTM(100, 150, batch_first=True)
    self.fc = nn.Linear(150, vocab_size)

  def forward(self, x):
    embedded = self.embedding(x)
    intermediate_hidden_states, (final_hidden_state, final_cell_state) = self.lstm(embedded)
    output = self.fc(final_hidden_state.squeeze(0))
    return output

In [29]:
model = LSTMModel(len(vocab))

In [30]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [31]:
model.to(device)

LSTMModel(
  (embedding): Embedding(757, 100)
  (lstm): LSTM(100, 150, batch_first=True)
  (fc): Linear(in_features=150, out_features=757, bias=True)
)

In [32]:
epochs = 50
learning_rate = 0.001

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [33]:
# training loop

for epoch in range(epochs):
  total_loss = 0

  for batch_x, batch_y in dataloader:

    batch_x, batch_y = batch_x.to(device), batch_y.to(device)

    optimizer.zero_grad()

    output = model(batch_x)

    loss = criterion(output, batch_y)

    loss.backward()

    optimizer.step()

    total_loss = total_loss + loss.item()

  print(f"Epoch: {epoch + 1}, Loss: {total_loss:.4f}")

Epoch: 1, Loss: 322.4651
Epoch: 2, Loss: 279.0061
Epoch: 3, Loss: 264.3045
Epoch: 4, Loss: 247.8249
Epoch: 5, Loss: 228.6953
Epoch: 6, Loss: 209.7236
Epoch: 7, Loss: 190.6873
Epoch: 8, Loss: 171.8851
Epoch: 9, Loss: 153.0589
Epoch: 10, Loss: 134.8252
Epoch: 11, Loss: 118.4457
Epoch: 12, Loss: 102.3740
Epoch: 13, Loss: 88.9196
Epoch: 14, Loss: 75.6323
Epoch: 15, Loss: 64.9888
Epoch: 16, Loss: 56.0544
Epoch: 17, Loss: 46.9788
Epoch: 18, Loss: 40.8788
Epoch: 19, Loss: 35.1707
Epoch: 20, Loss: 31.0485
Epoch: 21, Loss: 26.9650
Epoch: 22, Loss: 24.1203
Epoch: 23, Loss: 21.7406
Epoch: 24, Loss: 19.4603
Epoch: 25, Loss: 17.9198
Epoch: 26, Loss: 16.5475
Epoch: 27, Loss: 15.2915
Epoch: 28, Loss: 14.2617
Epoch: 29, Loss: 13.5962
Epoch: 30, Loss: 13.0230
Epoch: 31, Loss: 12.2481
Epoch: 32, Loss: 11.6848
Epoch: 33, Loss: 11.1408
Epoch: 34, Loss: 11.0116
Epoch: 35, Loss: 10.5393
Epoch: 36, Loss: 10.1405
Epoch: 37, Loss: 9.6748
Epoch: 38, Loss: 9.6412
Epoch: 39, Loss: 9.3071
Epoch: 40, Loss: 9.4383
E

In [34]:
# prediction

def prediction(model, vocab, text):

  # tokenize
  tokenized_text = word_tokenize(text.lower())

  # text -> numerical indices
  numerical_text = text_to_indices(tokenized_text, vocab)

  # padding
  padded_text = torch.tensor([0] * (61 - len(numerical_text)) + numerical_text, dtype=torch.long).unsqueeze(0)

  # send to model
  output = model(padded_text)

  # predicted index
  value, index = torch.max(output, dim=1)

  # merge with text
  return text + " " + list(vocab.keys())[index]



In [35]:
prediction(model, vocab, "The course follows a monthly")

'The course follows a monthly timetable'

In [39]:
import time

num_tokens = 10
input_text = "Videos appear"

for i in range(num_tokens):
  output_text = prediction(model, vocab, input_text)
  print(output_text)
  input_text = output_text
  time.sleep(0.5)


Videos appear in
Videos appear in the
Videos appear in the dashboard
Videos appear in the dashboard within
Videos appear in the dashboard within the
Videos appear in the dashboard within the validity
Videos appear in the dashboard within the validity period
Videos appear in the dashboard within the validity period and
Videos appear in the dashboard within the validity period and for
Videos appear in the dashboard within the validity period and for payment


In [40]:
dataloader1 = DataLoader(dataset, batch_size=32, shuffle=False)

In [41]:
# Function to calculate accuracy
def calculate_accuracy(model, dataloader, device):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():  # No need to compute gradients
        for batch_x, batch_y in dataloader1:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)

            # Get model predictions
            outputs = model(batch_x)

            # Get the predicted word indices
            _, predicted = torch.max(outputs, dim=1)

            # Compare with actual labels
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

    accuracy = correct / total * 100
    return accuracy

# Compute accuracy
accuracy = calculate_accuracy(model, dataloader, device)
print(f"Model Accuracy: {accuracy:.2f}%")


Model Accuracy: 94.84%
