Skip to content

moghit-eou/Text-Classification-TF-IDF-Word2Vec-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

End-to-End Email Categorization (TF‑IDF + Word2Vec)

Technologies Used

Python Flask spaCy Scikit-learn n8n Hugging Face

Libraries

  • NLP: spaCy, Gensim (Word2Vec)
  • ML: scikit-learn (TF‑IDF, PCA, Logistic Regression)
  • Serving: Flask
  • Automation: n8n
  • Utilities: NumPy, Pandas, Joblib

Project Overview

This project is a component of a larger n8n automation workflow, showcasing how to categorize HR-related emails using modern NLP and machine learning techniques.

  • πŸš€ Effortlessly manage HR emails with a fast, production-ready pipeline.
  • πŸ“¨ Automatic categorization: high_priority, low_priority, or job_applicant.
  • 🧠 Smarter text analysis using TF‑IDF + Word2Vec.
  • ⚑ Rapid, memory-efficient predictions with PCA & float16 quantization.
  • πŸ”— Easy integration via scalable Flask API & n8n workflows for automated routing.

✨ Features Overview

  • πŸ“– Documentation (French):
    See Chapter 5 of the report for a detailed explanation of the project, including methodology, results, and technical choices. (Content in French)

  • 🧠 NLP Preprocessing (spaCy):
    Tokenization, lowercasing, stop-word filtering, and optional lemmatization for clean input.

  • πŸ”€ Hybrid Vectorization:

    • TF‑IDF baseline for lexical signals.
    • Word2Vec for semantic signals, using mean pooling for robust document vectors.
  • πŸ€– Modeling Choices:
    Compared Logistic Regression (selected), Naive Bayes, and KNN for best results.

  • ⚑ Dimensionality Reduction:
    PCA compresses vectors from 300 β†’ 100 dims for lightning-fast inference and compact storage.

  • πŸ’Ύ Quantization:
    Vectors stored as float16 to minimize memory usage.

  • 🌐 REST API Endpoints:
    /classification and /vectorization (GET) for easy integration.

  • πŸ”— Workflow Automation (n8n):
    Gmail Trigger β†’ HTTP Request (API) β†’ Switch & routing for seamless email management.

  • πŸš€ Deployment Ready:
    Supports Hugging Face Spaces, Vercel, and traditional servers for flexible hosting.


Project Structure

Text-Categorization-TF-IDF-Word2Vec-Embeddings/
β”œβ”€ Flask-API/                # Flask server (API endpoints, loading vectors/model)
β”œβ”€ NLP_TF-IDF/               # TF-IDF experiments/baselines
β”œβ”€ NLP_word2vec/             # Word2Vec + PCA reduction utilities and notebooks
β”œβ”€ data/                     # dataset
β”œβ”€ requirements.txt
β”œβ”€ vercel.json               # (optional) hosting config
└─ README.md

How to Run

1. Clone the Repository

git clone https://github.com/moghit-eou/Text-Categorization-TF-IDF-Word2Vec-Embeddings.git
cd Text-Categorization-TF-IDF-Word2Vec-Embeddings

2. Create Environment & Install Deps

python3.10 -m venv .venv
# Windows: py -3.10 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Prepare Word2Vec Vectors

  1. Download GoogleNews Word2Vec (300‑dim, ~3.5GB) - Get the original GoogleNews-vectors-negative300.bin from Hugging Face.

  2. Reduce Dimensionality & Quantize - Apply PCA to compress vectors from 300 β†’ 100 dimensions. - Save as float16 keyed vectors for efficient storage.

PCA & Quantization
Why PCA & Quantization? β–Ό
  • PCA shrinks the original 300-dim Word2Vec vectors to 100 dimensions, reducing size from ~3.5GB to just ~600MB, speeding up inference, and making deployment practical for cloud/serverless.
  • Quantization (float16) further cuts memory footprint.

PCA Comparison: 300D vs 100D

Visualization:
The image above shows a 2D projection of document vectors for the three categories before and after PCA reduction (300D β†’ 100D).
As seen, the separation between classes remains clear, indicating that dimensionality reduction does not significantly impact model accuracy.


  1. Transform Text & Classify
  • Vectorize Email:
    Convert the email body into a vector using mean-pooled Word2Vec embeddings:
    Email Vectorization Example
Making sure mean-pooling is applied correctly ?β–Ό

After representing each email with a single vector (via mean-pooling of Word2Vec embeddings),
I can visualize these document vectors in 2D space using PCA.

As shown below, the clusters demonstrate that it’s possible to separate and classify emails into their categories.

Mean Pooling PCA Visualization
  • Classify with ML Model:
    Training Logistic Regression classifier to predict the category (high_priority, low_priority, or job_applicant).

    Model Accuracy Macro‑F1
    Naive Bayes 0.83 0.82
    KNN 0.61 0.85
    Logistic Regression 0.90 0.89

    Metrics are indicative; expect variation with dataset size, class balance, preprocessing, and PCA dimension.

Confusion Matrix of the Classifier β–Ό

The confusion matrix shows how well the classifier distinguishes between
the three categories: High Priority, Low Priority, and Job Applicant.

It helps to identify where misclassifications occur (e.g., borderline cases between high and low priority).

Confusion Matrix

Deploy on Hugging Face Spaces using Docker

Note: Only Hugging Face Spaces and AWS EC2 worked for me Free platforms like Vercel and similar only work for small models; for larger models/vectors, you need a paid plan.

  1. Ready-to-use Demo
    Access the live app: Hugging Face Space

Note: The Space may be in sleep mode; allow ~20s for startup.

  1. Dockerized Deployment
    The API is containerized for reproducible deployment.
  • Build: docker build -t email-categorizer .
  • Run: docker run -p 7860:7860 email-categorizer
  1. Automatic Startup
    Hugging Face Spaces handles container orchestration and exposes the Flask API.

  2. Direct Access
    Test endpoints via browser or curl once the Space is active.



API Endpoints

GET /classification

Email Vectorization Example
  • Query: email_body (string)
curl --get "URL/classification" \
     --data-urlencode "email_body=Hello HR, I’m applying for the Data Analyst role. CV attached."

Example Response

{"prediction":"job_applicant"}

GET /vectorization

  • Query: text (string)
curl --get "http://localhost:7860/vectorization" \
     --data-urlencode "text=Schedule an urgent meeting for project X today at 3pm."

Example Response

{"vector":[0.0132,-0.0479,0.0211, ... 100 dims ...]}

Document vectors are mean-pooled Word2Vec embeddings .


N8N Flow

Email Vectorization Example

Node Hints

  • HTTP Request: GET YOUR_API_URL/classification?email_body={{$json.email_body}}
  • Switch: route on {{$json.prediction}}

License

This project is licensed under the MIT License. You are free to use, modify, and distribute with attribution.

Happy Automating!

⭐ If you found this project useful, please give it a star!

GitHub stars

About

Deploying large models (500MB+) via GitHub is complex due to size limits. This project leverages Hugging Face Spaces to overcome these difficulties .Live demo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages