Three comprehensive NLP projects: Question Classification with Naive Bayes, Word Embeddings (TF-IDF, PPMI, Word2Vec), and BERT-based Question Answering.
About • Projects • Structure • Usage • Tech Stack
This repository demonstrates modern computational linguistics techniques through three progressive projects:
- Text classification using probabilistic language models
- Semantic similarity with vector space models
- Extractive question answering with transformer architectures
The projects explore fundamental and advanced NLP concepts:
- Language Modeling: N-gram models with smoothing techniques
- Classification: Naive Bayes with OOV handling
- Embeddings: From sparse (TF-IDF, PPMI) to dense (Word2Vec)
- Transformers: Fine-tuning BERT for question answering
Text classification for Stack Overflow questions using probabilistic models.
Techniques:
- Vocabulary building with frequency cutoff
- Bigram language models (MLE, Laplace smoothing)
- Multinomial and Binary Naive Bayes
- OOV handling with
<UNK>tokens
Results: 75-80% test accuracy on 60k Stack Overflow questions
Word embeddings and semantic similarity using different vector space models.
Methods:
- TF-IDF: Term frequency-inverse document frequency weighting
- PPMI: Positive Pointwise Mutual Information matrices
- Word2Vec: Skip-gram neural embeddings (150D)
Key Features:
- Context window of 7 words
- Cosine similarity for semantic matching
- Comparison of sparse vs dense representations
BERT-based extractive QA system fine-tuned on SQuAD dataset.
Implementation:
- Fine-tuning
bert-base-uncasedon Stanford Question Answering Dataset - Token-level span detection for answer extraction
- GPU-accelerated training with PyTorch Accelerate
- Confidence scoring for predictions
Dataset: Stanford Question Answering Dataset (SQuAD)
NLP-Models/
├── question_classification/
│ └── question_classification.ipynb # Project 1 notebook
│ └── corpora/
│ ├── train.csv # Training data
│ └── test.csv # Test data
├── vector_semantics/
│ └── vector_semantics.ipynb # Project 2 notebook
│ └── corpora/
│ ├── corpus.txt # Full corpus
│ └── toy_corpus.txt # Small corpus for testing
├── question_answering/
│ ├── question_answering_1.ipynb # QA exploration
│ └── question_answering_2.ipynb # QA implementation
└── README.md
# Clone the repository
git clone https://github.com/mathisdelsart/NLP-Models.git
cd NLP-ModelsOpen the Jupyter notebooks in order:
- Question Classification:
question_classification/question_classification.ipynb - Vector Semantics:
vector_semantics/vector_semantics.ipynb - Question Answering:
question_answering/question_answering_1.ipynb→question_answering/question_answering_2.ipynb
Core Technologies:
- Python 3.8+
- NLTK - Text preprocessing and tokenization
- Gensim - Word2Vec implementation
- Transformers - BERT and pretrained models
- PyTorch - Deep learning framework
Data & ML:
- NumPy/SciPy - Numerical computing
- scikit-learn - ML utilities and metrics
- Pandas - Data manipulation
| Concept | Techniques | Application |
|---|---|---|
| Language Modeling | N-grams, MLE, Laplace | Text generation, perplexity |
| Classification | Naive Bayes | Question quality prediction |
| Embeddings | TF-IDF, PPMI, Word2Vec | Semantic similarity |
| Transformers | BERT fine-tuning | Extractive QA |
This project is developed for academic purposes as part of university coursework.
Built for LINFO2263 - Computational Linguistics @ UCLouvain (Université catholique de Louvain).