End-to-End Email Categorization (TF‑IDF + Word2Vec)

Technologies Used

Libraries

NLP: spaCy, Gensim (Word2Vec)
ML: scikit-learn (TF‑IDF, PCA, Logistic Regression)
Serving: Flask
Automation: n8n
Utilities: NumPy, Pandas, Joblib

Project Overview

This project is a component of a larger n8n automation workflow, showcasing how to categorize HR-related emails using modern NLP and machine learning techniques.

🚀 Effortlessly manage HR emails with a fast, production-ready pipeline.
📨 Automatic categorization: high_priority, low_priority, or job_applicant.
🧠 Smarter text analysis using TF‑IDF + Word2Vec.
⚡ Rapid, memory-efficient predictions with PCA & float16 quantization.
🔗 Easy integration via scalable Flask API & n8n workflows for automated routing.

✨ Features Overview

📖 Documentation (French):
See Chapter 5 of the report for a detailed explanation of the project, including methodology, results, and technical choices. (Content in French)
🧠 NLP Preprocessing (spaCy):
Tokenization, lowercasing, stop-word filtering, and optional lemmatization for clean input.
🔀 Hybrid Vectorization:
- TF‑IDF baseline for lexical signals.
- Word2Vec for semantic signals, using mean pooling for robust document vectors.
🤖 Modeling Choices:
Compared Logistic Regression (selected), Naive Bayes, and KNN for best results.
⚡ Dimensionality Reduction:
PCA compresses vectors from 300 → 100 dims for lightning-fast inference and compact storage.
💾 Quantization:
Vectors stored as float16 to minimize memory usage.
🌐 REST API Endpoints:
/classification and /vectorization (GET) for easy integration.
🔗 Workflow Automation (n8n):
Gmail Trigger → HTTP Request (API) → Switch & routing for seamless email management.
🚀 Deployment Ready:
Supports Hugging Face Spaces, Vercel, and traditional servers for flexible hosting.

Project Structure

Text-Categorization-TF-IDF-Word2Vec-Embeddings/
├─ Flask-API/                # Flask server (API endpoints, loading vectors/model)
├─ NLP_TF-IDF/               # TF-IDF experiments/baselines
├─ NLP_word2vec/             # Word2Vec + PCA reduction utilities and notebooks
├─ data/                     # dataset
├─ requirements.txt
├─ vercel.json               # (optional) hosting config
└─ README.md

How to Run

1. Clone the Repository

git clone https://github.com/moghit-eou/Text-Categorization-TF-IDF-Word2Vec-Embeddings.git
cd Text-Categorization-TF-IDF-Word2Vec-Embeddings

2. Create Environment & Install Deps

python3.10 -m venv .venv
# Windows: py -3.10 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Prepare Word2Vec Vectors

Download GoogleNews Word2Vec (300‑dim, ~3.5GB) - Get the original GoogleNews-vectors-negative300.bin from Hugging Face.
Reduce Dimensionality & Quantize - Apply PCA to compress vectors from 300 → 100 dimensions. - Save as float16 keyed vectors for efficient storage.

Why PCA & Quantization? ^▼

PCA shrinks the original 300-dim Word2Vec vectors to 100 dimensions, reducing size from ~3.5GB to just ~600MB, speeding up inference, and making deployment practical for cloud/serverless.
Quantization (float16) further cuts memory footprint.

Visualization:
The image above shows a 2D projection of document vectors for the three categories before and after PCA reduction (300D → 100D).
As seen, the separation between classes remains clear, indicating that dimensionality reduction does not significantly impact model accuracy.

Transform Text & Classify

Vectorize Email:
Convert the email body into a vector using mean-pooled Word2Vec embeddings:

Making sure mean-pooling is applied correctly ?^▼

After representing each email with a single vector (via mean-pooling of Word2Vec embeddings),
I can visualize these document vectors in 2D space using PCA.

As shown below, the clusters demonstrate that it’s possible to separate and classify emails into their categories.

Classify with ML Model:
Training Logistic Regression classifier to predict the category (high_priority, low_priority, or job_applicant).

Model Accuracy Macro‑F1

Naive Bayes 0.83 0.82

KNN 0.61 0.85

Logistic Regression 0.90 0.89

Metrics are indicative; expect variation with dataset size, class balance, preprocessing, and PCA dimension.

Confusion Matrix of the Classifier ^▼

The confusion matrix shows how well the classifier distinguishes between
the three categories: High Priority, Low Priority, and Job Applicant.

It helps to identify where misclassifications occur (e.g., borderline cases between high and low priority).

Deploy on Hugging Face Spaces using Docker

Note: Only Hugging Face Spaces and AWS EC2 worked for me Free platforms like Vercel and similar only work for small models; for larger models/vectors, you need a paid plan.

Ready-to-use Demo
Access the live app: Hugging Face Space

Note: The Space may be in sleep mode; allow ~20s for startup.

Dockerized Deployment
The API is containerized for reproducible deployment.

Build: docker build -t email-categorizer .
Run: docker run -p 7860:7860 email-categorizer

Automatic Startup
Hugging Face Spaces handles container orchestration and exposes the Flask API.
Direct Access
Test endpoints via browser or curl once the Space is active.

API Endpoints

`GET /classification`

Query: email_body (string)

curl --get "URL/classification" \
     --data-urlencode "email_body=Hello HR, I’m applying for the Data Analyst role. CV attached."

Example Response

{"prediction":"job_applicant"}

`GET /vectorization`

Query: text (string)

curl --get "http://localhost:7860/vectorization" \
     --data-urlencode "text=Schedule an urgent meeting for project X today at 3pm."

Example Response

{"vector":[0.0132,-0.0479,0.0211, ... 100 dims ...]}

Document vectors are mean-pooled Word2Vec embeddings .

N8N Flow

Node Hints

HTTP Request: GET YOUR_API_URL/classification?email_body={{$json.email_body}}
Switch: route on {{$json.prediction}}

License

This project is licensed under the MIT License. You are free to use, modify, and distribute with attribution.

Happy Automating!

⭐ If you found this project useful, please give it a star!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Email Categorization (TF‑IDF + Word2Vec)

Technologies Used

Libraries

Project Overview

✨ Features Overview

Project Structure

How to Run

1. Clone the Repository

2. Create Environment & Install Deps

3. Prepare Word2Vec Vectors

Deploy on Hugging Face Spaces using Docker

API Endpoints

`GET /classification`

`GET /vectorization`

N8N Flow

License

Happy Automating!

⭐ If you found this project useful, please give it a star!

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Flask-API		Flask-API
NLP_TF-IDF		NLP_TF-IDF
NLP_word2vec		NLP_word2vec
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
vercel.json		vercel.json

Model	Accuracy	Macro‑F1
Naive Bayes	0.83	0.82
KNN	0.61	0.85
Logistic Regression	0.90	0.89

Folders and files

Latest commit

History

Repository files navigation

End-to-End Email Categorization (TF‑IDF + Word2Vec)

Technologies Used

Libraries

Project Overview

✨ Features Overview

Project Structure

How to Run

1. Clone the Repository

2. Create Environment & Install Deps

3. Prepare Word2Vec Vectors

Deploy on Hugging Face Spaces using Docker

API Endpoints

GET /classification

GET /vectorization

N8N Flow

License

Happy Automating!

⭐ If you found this project useful, please give it a star!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`GET /classification`

`GET /vectorization`

Packages