- NLP: spaCy, Gensim (Word2Vec)
- ML: scikit-learn (TFβIDF, PCA, Logistic Regression)
- Serving: Flask
- Automation: n8n
- Utilities: NumPy, Pandas, Joblib
This project is a component of a larger n8n automation workflow, showcasing how to categorize HR-related emails using modern NLP and machine learning techniques.
- π Effortlessly manage HR emails with a fast, production-ready pipeline.
- π¨ Automatic categorization:
high_priority,low_priority, orjob_applicant. - π§ Smarter text analysis using TFβIDF + Word2Vec.
- β‘ Rapid, memory-efficient predictions with PCA & float16 quantization.
- π Easy integration via scalable Flask API & n8n workflows for automated routing.
-
π Documentation (French):
See Chapter 5 of the report for a detailed explanation of the project, including methodology, results, and technical choices. (Content in French) -
π§ NLP Preprocessing (spaCy):
Tokenization, lowercasing, stop-word filtering, and optional lemmatization for clean input. -
π Hybrid Vectorization:
- TFβIDF baseline for lexical signals.
- Word2Vec for semantic signals, using mean pooling for robust document vectors.
-
π€ Modeling Choices:
Compared Logistic Regression (selected), Naive Bayes, and KNN for best results. -
β‘ Dimensionality Reduction:
PCA compresses vectors from 300 β 100 dims for lightning-fast inference and compact storage. -
πΎ Quantization:
Vectors stored as float16 to minimize memory usage. -
π REST API Endpoints:
/classificationand/vectorization(GET) for easy integration. -
π Workflow Automation (n8n):
Gmail Trigger β HTTP Request (API) β Switch & routing for seamless email management. -
π Deployment Ready:
Supports Hugging Face Spaces, Vercel, and traditional servers for flexible hosting.
Text-Categorization-TF-IDF-Word2Vec-Embeddings/
ββ Flask-API/ # Flask server (API endpoints, loading vectors/model)
ββ NLP_TF-IDF/ # TF-IDF experiments/baselines
ββ NLP_word2vec/ # Word2Vec + PCA reduction utilities and notebooks
ββ data/ # dataset
ββ requirements.txt
ββ vercel.json # (optional) hosting config
ββ README.md
git clone https://github.com/moghit-eou/Text-Categorization-TF-IDF-Word2Vec-Embeddings.git
cd Text-Categorization-TF-IDF-Word2Vec-Embeddingspython3.10 -m venv .venv
# Windows: py -3.10 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt-
Download GoogleNews Word2Vec (300βdim, ~3.5GB) - Get the original
GoogleNews-vectors-negative300.binfrom Hugging Face. -
Reduce Dimensionality & Quantize - Apply PCA to compress vectors from 300 β 100 dimensions. - Save as float16 keyed vectors for efficient storage.
Why PCA & Quantization? βΌ
- PCA shrinks the original 300-dim Word2Vec vectors to 100 dimensions, reducing size from ~3.5GB to just ~600MB, speeding up inference, and making deployment practical for cloud/serverless.
- Quantization (float16) further cuts memory footprint.
Visualization:
The image above shows a 2D projection of document vectors for the three categories before and after PCA reduction (300D β 100D).
As seen, the separation between classes remains clear, indicating that dimensionality reduction does not significantly impact model accuracy.
- Transform Text & Classify
- Vectorize Email:
Convert the email body into a vector using mean-pooled Word2Vec embeddings:
Making sure mean-pooling is applied correctly ?βΌ
-
Classify with ML Model:
Training Logistic Regression classifier to predict the category (high_priority,low_priority, orjob_applicant).Model Accuracy MacroβF1 Naive Bayes 0.83 0.82 KNN 0.61 0.85 Logistic Regression 0.90 0.89 Metrics are indicative; expect variation with dataset size, class balance, preprocessing, and PCA dimension.
Confusion Matrix of the Classifier βΌ
Note: Only Hugging Face Spaces and AWS EC2 worked for me Free platforms like Vercel and similar only work for small models; for larger models/vectors, you need a paid plan.
- Ready-to-use Demo
Access the live app: Hugging Face Space
Note: The Space may be in sleep mode; allow ~20s for startup.
- Dockerized Deployment
The API is containerized for reproducible deployment.
- Build:
docker build -t email-categorizer . - Run:
docker run -p 7860:7860 email-categorizer
-
Automatic Startup
Hugging Face Spaces handles container orchestration and exposes the Flask API. -
Direct Access
Test endpoints via browser orcurlonce the Space is active.
- Query:
email_body(string)
curl --get "URL/classification" \
--data-urlencode "email_body=Hello HR, Iβm applying for the Data Analyst role. CV attached."Example Response
{"prediction":"job_applicant"}- Query:
text(string)
curl --get "http://localhost:7860/vectorization" \
--data-urlencode "text=Schedule an urgent meeting for project X today at 3pm."Example Response
{"vector":[0.0132,-0.0479,0.0211, ... 100 dims ...]}Document vectors are mean-pooled Word2Vec embeddings .
Node Hints
- HTTP Request:
GET YOUR_API_URL/classification?email_body={{$json.email_body}} - Switch: route on
{{$json.prediction}}
This project is licensed under the MIT License. You are free to use, modify, and distribute with attribution.







