Skip to content

mathisdelsart/NLP-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Techniques: From Probabilistic Models to Transformers

Python NLTK Transformers

Three comprehensive NLP projects: Question Classification with Naive Bayes, Word Embeddings (TF-IDF, PPMI, Word2Vec), and BERT-based Question Answering.


AboutProjectsStructureUsageTech Stack


About

This repository demonstrates modern computational linguistics techniques through three progressive projects:

  • Text classification using probabilistic language models
  • Semantic similarity with vector space models
  • Extractive question answering with transformer architectures

Core Focus

The projects explore fundamental and advanced NLP concepts:

  • Language Modeling: N-gram models with smoothing techniques
  • Classification: Naive Bayes with OOV handling
  • Embeddings: From sparse (TF-IDF, PPMI) to dense (Word2Vec)
  • Transformers: Fine-tuning BERT for question answering

Projects

Project 1: Question Classification

Text classification for Stack Overflow questions using probabilistic models.

Techniques:

  • Vocabulary building with frequency cutoff
  • Bigram language models (MLE, Laplace smoothing)
  • Multinomial and Binary Naive Bayes
  • OOV handling with <UNK> tokens

Results: 75-80% test accuracy on 60k Stack Overflow questions

Project 2: Vector Semantics

Word embeddings and semantic similarity using different vector space models.

Methods:

  • TF-IDF: Term frequency-inverse document frequency weighting
  • PPMI: Positive Pointwise Mutual Information matrices
  • Word2Vec: Skip-gram neural embeddings (150D)

Key Features:

  • Context window of 7 words
  • Cosine similarity for semantic matching
  • Comparison of sparse vs dense representations

Project 3: Question Answering

BERT-based extractive QA system fine-tuned on SQuAD dataset.

Implementation:

  • Fine-tuning bert-base-uncased on Stanford Question Answering Dataset
  • Token-level span detection for answer extraction
  • GPU-accelerated training with PyTorch Accelerate
  • Confidence scoring for predictions

Dataset: Stanford Question Answering Dataset (SQuAD)

Structure

NLP-Models/
├── question_classification/
│   └── question_classification.ipynb    # Project 1 notebook
│   └── corpora/
│       ├── train.csv                    # Training data
│       └── test.csv                     # Test data
├── vector_semantics/
│   └── vector_semantics.ipynb           # Project 2 notebook
│   └── corpora/
│       ├── corpus.txt                   # Full corpus
│       └── toy_corpus.txt               # Small corpus for testing
├── question_answering/
│   ├── question_answering_1.ipynb       # QA exploration
│   └── question_answering_2.ipynb       # QA implementation
└── README.md

Usage

Quick Start

# Clone the repository
git clone https://github.com/mathisdelsart/NLP-Models.git
cd NLP-Models

Running Projects

Open the Jupyter notebooks in order:

  1. Question Classification: question_classification/question_classification.ipynb
  2. Vector Semantics: vector_semantics/vector_semantics.ipynb
  3. Question Answering: question_answering/question_answering_1.ipynbquestion_answering/question_answering_2.ipynb

Tech Stack

Core Technologies:

  • Python 3.8+
  • NLTK - Text preprocessing and tokenization
  • Gensim - Word2Vec implementation
  • Transformers - BERT and pretrained models
  • PyTorch - Deep learning framework

Data & ML:

  • NumPy/SciPy - Numerical computing
  • scikit-learn - ML utilities and metrics
  • Pandas - Data manipulation

Key Concepts

Concept Techniques Application
Language Modeling N-grams, MLE, Laplace Text generation, perplexity
Classification Naive Bayes Question quality prediction
Embeddings TF-IDF, PPMI, Word2Vec Semantic similarity
Transformers BERT fine-tuning Extractive QA

Author

Mathis DELSART

License

This project is developed for academic purposes as part of university coursework.


Built for LINFO2263 - Computational Linguistics @ UCLouvain (Université catholique de Louvain).

About

End-to-end NLP projects from probabilistic models to transformers. Covers text classification, word embeddings, and BERT-based question answering. Built with Python, NLTK, Gensim, PyTorch, and Hugging Face.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors