Skip to content

hashry0/sentiment_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Sentiment Classification with Numpy

A multiclass sentiment analysis model built entirely with NumPy, no PyTorch, no TensorFlow. This project implements tokenization, word embeddings, a forward pass, backpropagation, and evaluation from the ground up.

Built as a learning exercise to understand what happens under the hood of modern ML frameworks.


How it works

Text → Tokenization → Word Embeddings → Mean Pooling → Linear Layer → Softmax → Prediction

  1. Raw text is cleaned and tokenized into vocabulary indices
  2. Each word is mapped to a 10-dimensional embedding vector, initialized randomly
  3. Sentence representation is computed by averaging all word embeddings
  4. A linear classifier maps the sentence vector to class probabilities via softmax
  5. Weights, biases, and embeddings are updated each epoch through gradient descent and backpropagation
  6. Training stops when parameter updates fall below a set tolerance

Requirements

numpy
pandas

Install with:

pip install numpy pandas

Usage

Place your dataset as sentiment_analysis.csv in the project root. The file should have two columns: text and sentiment.

# Train
check = Model(mode="train", learning_step=1, tolerance=1e-3)
check.Training()
check.Evaluate(check.weights, check.bias, check.embedding)

# Evaluate on test set
check_test = Model(mode="test")
check_test.Evaluate(check.weights, check.bias, check.embedding)

Run with:

python sentiment_classification.py

Results

Split Accuracy
Training 94.4%
Test 22.0%

The gap between training and test accuracy is expected given the simplicity of the architecture — mean pooling loses word order and a 10-dimensional embedding is a limited representation. This is discussed in detail in the accompanying blog post.


Blog

Full walkthrough: Building a Sentiment Analysis Model from Scratch


Limitations

  • No sequence modeling — word order is lost through mean pooling
  • Small embedding dimension (10) limits representational capacity
  • Linear classifier only — no hidden layers
  • No out-of-vocabulary handling

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages