A multiclass sentiment analysis model built entirely with NumPy, no PyTorch, no TensorFlow. This project implements tokenization, word embeddings, a forward pass, backpropagation, and evaluation from the ground up.
Built as a learning exercise to understand what happens under the hood of modern ML frameworks.
Text → Tokenization → Word Embeddings → Mean Pooling → Linear Layer → Softmax → Prediction
- Raw text is cleaned and tokenized into vocabulary indices
- Each word is mapped to a 10-dimensional embedding vector, initialized randomly
- Sentence representation is computed by averaging all word embeddings
- A linear classifier maps the sentence vector to class probabilities via softmax
- Weights, biases, and embeddings are updated each epoch through gradient descent and backpropagation
- Training stops when parameter updates fall below a set tolerance
numpy
pandas
Install with:
pip install numpy pandasPlace your dataset as sentiment_analysis.csv in the project root.
The file should have two columns: text and sentiment.
# Train
check = Model(mode="train", learning_step=1, tolerance=1e-3)
check.Training()
check.Evaluate(check.weights, check.bias, check.embedding)
# Evaluate on test set
check_test = Model(mode="test")
check_test.Evaluate(check.weights, check.bias, check.embedding)Run with:
python sentiment_classification.py| Split | Accuracy |
|---|---|
| Training | 94.4% |
| Test | 22.0% |
The gap between training and test accuracy is expected given the simplicity of the architecture — mean pooling loses word order and a 10-dimensional embedding is a limited representation. This is discussed in detail in the accompanying blog post.
Full walkthrough: Building a Sentiment Analysis Model from Scratch
- No sequence modeling — word order is lost through mean pooling
- Small embedding dimension (10) limits representational capacity
- Linear classifier only — no hidden layers
- No out-of-vocabulary handling