### A little backgroung
I have asked 8 of my friends about one thing that they would want to be automated or done through machine learning. 3 of them said that they get many unnecessary emails everyday and they were asking for a tool that would classify emails for them and would only send them notifications if they are important. Based on that, we started our journey.

### About the Dataset
These are actual emails we got from my hkbu and personal email. There are 4 columns left after cleaning the data. I have removed any links that were in the data. unfortunately we could not get the images from the emails. 
Now based on our interest in emails, we have classified them and gave them an importance value.
##### 0.0 -> this represents spam
##### 1.0 -> represents any marketing emails from differnt companies and most of the times we donot want to see them.
##### 2.0 -> emails from our univeristy that seems like marketing. this ones are a little more validated.
##### 3.0 -> these are internship related emails. mostly from our university.
##### 4.0 -> These are emails from our classmates, university professors regarding any courses
##### 5.0 -> You might think that these are the most important ones which is partially true. These are all emails from our banks, students halls, internship offer letters and related.



In [60]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, classification_report
from sklearn.base import clone
import warnings
warnings.filterwarnings('ignore')


In [2]:
df = pd.read_excel("Combined_Email.xlsx")
df.head()

Unnamed: 0,From,Subject,Clean Body (No Links),Importance
0,HKBU,Today@HKBU,\nExciting events at HKBU today.\n\nIf you can...,2.0
1,Estates Office,Online Dental Health Talk - Daily Oral Health ...,"Dear Students and Colleagues,\n\n \n\nWe are e...",2.0
2,Language Centre Language Enhancement Programme,[Invitation] The Grand Final of the Three-Minu...,"Dear Students and Colleagues,_x000D_\n_x000D_\...",2.0
3,Career Centre,Join the InnoX Entrepreneurship Summer Camp 20...,< _x000D_\n < _x000D_\n < _x000D_\n_x000D_\nI...,2.0
4,Language Centre Language Enhancement Programme,[Summer Course] Register for SUPG1010 German I...,"Dear students,_x000D_\n _x000D_\nSUPG1010 Germ...",2.0


In [3]:
df.shape

(5500, 4)

In [4]:
df.isna().sum()

From                      0
Subject                   9
Clean Body (No Links)    16
Importance                7
dtype: int64

In [5]:
df['Importance'] = df['Importance'].fillna(1)

In [6]:
df = df.fillna("")

In [7]:
df[df['Importance']>10]

Unnamed: 0,From,Subject,Clean Body (No Links),Importance
883,Estates Office,Student Dental Scheme 2024-2025,< _x000D_\n_x000D_\n _x000D_\n_x000D_\nDear...,22.0
905,"Leadership Qualities Centre, Office of Student...",【CCL X 2 Units】HKBU Skilled Volunteer Cadre 20...,各位同學：_x000D_\n _x000D_\n < _x000D_\n _x000D_\n...,22.0
1053,Career Centre,Recruitment of Administrative Officers in 2024/25,"Dear students,_x000D_\n _x000D_\nThe AO Recrui...",33.0
3412,Glassdoor Jobs <noreply@glassdoor.com>,CRM Analyst (Data Analysis) at A.S. Watson Gro...,Chong Hing Bank is hiring ‌​‍‎‏ ‌​‍‎‏ ‌​‍‎‏ ‌​...,11.0
4699,22Bet <noreply@22bet.com>,🎉 Free bet for new wins,‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ...,11.0
4819,Medium <hello@medium.com>,Support your favorite programming writers for ...,"For a limited time, we’re offering a 20% disco...",11.0
5128,noreply@jijis.org.hk,Job Application for the Post of Intern (JIJIS-...,"Dear Sir/Madam, I am writing to apply for the ...",33.0


In [8]:
df['Importance'] = df['Importance'].replace(22,2)

In [9]:
df['Importance'] = df['Importance'].replace(11,1)
df['Importance'] = df['Importance'].replace(33,3)

In [10]:
df["Clean Body (No Links)"] = df["Clean Body (No Links)"].str.replace(r'_x000D_|\n', ' ', regex=True).str.strip()


In [23]:
df.head()

Unnamed: 0,From,Subject,Clean Body (No Links),Importance
0,HKBU,Today@HKBU,Exciting events at HKBU today. If you cannot ...,2.0
1,Estates Office,Online Dental Health Talk - Daily Oral Health ...,"Dear Students and Colleagues, We are excit...",2.0
2,Language Centre Language Enhancement Programme,[Invitation] The Grand Final of the Three-Minu...,"Dear Students and Colleagues, You are cordi...",2.0
3,Career Centre,Join the InnoX Entrepreneurship Summer Camp 20...,< < < InnoX Entrepreneurship Summer...,2.0
4,Language Centre Language Enhancement Programme,[Summer Course] Register for SUPG1010 German I...,"Dear students, SUPG1010 German I (Part 1) ...",2.0


In [25]:
df.dtypes

From                      object
Subject                   object
Clean Body (No Links)     object
Importance               float64
dtype: object

In [27]:
X = df[["From", "Subject", "Clean Body (No Links)"]]
y = df["Importance"]

In [29]:
# Convert all text columns to strings
X["From"] = X["From"].astype(str)
X["Subject"] = X["Subject"].astype(str)
X["Clean Body (No Links)"] = X["Clean Body (No Links)"].astype(str)

In [31]:
# TF-IDF Transformation
preprocessor = ColumnTransformer([
    ('from_tfidf', TfidfVectorizer(), 'From'),
    ('subject_tfidf', TfidfVectorizer(), 'Subject'),
    ('body_tfidf', TfidfVectorizer(), 'Clean Body (No Links)'),
])

In [33]:
# Fit TF-IDF and transform
X_tfidf = preprocessor.fit_transform(X)

In [34]:
# Split before SMOTE
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

In [35]:
# Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [40]:
# K-Fold Cross-Validation on training data
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [42]:
# Models to evaluate
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC(),
    "Gradient Boosting": GradientBoostingClassifier()
}

In [62]:
# Track F1-scores
results = {}

for name, model in models.items():
    print(f"Evaluating {name}...")
    fold_scores = []
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X_train_resampled, y_train_resampled), 1):
        X_tr, X_val = X_train_resampled[train_idx], X_train_resampled[val_idx]
        y_tr, y_val = y_train_resampled[train_idx], y_train_resampled[val_idx]
        
        clf = clone(model)
        clf.fit(X_tr, y_tr)
        y_pred = clf.predict(X_val)
        
        f1 = f1_score(y_val, y_pred, average='weighted')
        fold_scores.append(f1)
        print(f"  Fold {fold}: F1 = {f1:.4f}")
    
    results[name] = fold_scores
    print(f"Mean F1 for {name}: {np.mean(fold_scores):.4f}\n")

# Convert to DataFrame
f1_df = pd.DataFrame(results)
f1_df.index = [f"Fold {i+1}" for i in range(cv.get_n_splits())]
f1_df.loc["Mean"] = f1_df.mean()
print(" F1-Score Table (by fold):")
print(f1_df)

Evaluating Logistic Regression...
  Fold 1: F1 = 0.9717
  Fold 2: F1 = 0.9695
  Fold 3: F1 = 0.9710
  Fold 4: F1 = 0.9650
  Fold 5: F1 = 0.9628
Mean F1 for Logistic Regression: 0.9680

Evaluating Random Forest...
  Fold 1: F1 = 0.9768
  Fold 2: F1 = 0.9746
  Fold 3: F1 = 0.9694
  Fold 4: F1 = 0.9678
  Fold 5: F1 = 0.9673
Mean F1 for Random Forest: 0.9712

Evaluating SVM...
  Fold 1: F1 = 0.9734
  Fold 2: F1 = 0.9756
  Fold 3: F1 = 0.9722
  Fold 4: F1 = 0.9650
  Fold 5: F1 = 0.9650
Mean F1 for SVM: 0.9703

Evaluating Gradient Boosting...
  Fold 1: F1 = 0.9613
  Fold 2: F1 = 0.9607
  Fold 3: F1 = 0.9646
  Fold 4: F1 = 0.9529
  Fold 5: F1 = 0.9469
Mean F1 for Gradient Boosting: 0.9573

📊 F1-Score Table (by fold):
        Logistic Regression  Random Forest       SVM  Gradient Boosting
Fold 1             0.971732       0.976756  0.973401           0.961275
Fold 2             0.969478       0.974550  0.975594           0.960724
Fold 3             0.971050       0.969426  0.972211           0

In [63]:
# Select best model
best_model_name = f1_df.loc["Mean"].idxmax()
best_model = clone(models[best_model_name])
print(f"\n Best Model: {best_model_name} with Mean F1 = {f1_df.loc['Mean', best_model_name]:.4f}")

# Retrain on full resampled training data
best_model.fit(X_train_resampled, y_train_resampled)

# Final test set evaluation
y_test_pred = best_model.predict(X_test)
print("\n Final Evaluation on Unseen Test Set:")
print(classification_report(y_test, y_test_pred, digits=4))



✅ Best Model: Random Forest with Mean F1 = 0.9712

📈 Final Evaluation on Unseen Test Set:
              precision    recall  f1-score   support

         1.0     0.9569    0.9592    0.9581       417
         2.0     0.9143    0.9447    0.9293       452
         3.0     0.7870    0.7658    0.7763       111
         4.0     0.9615    0.7576    0.8475        33
         5.0     0.7901    0.7356    0.7619        87

    accuracy                         0.9100      1100
   macro avg     0.8820    0.8326    0.8546      1100
weighted avg     0.9092    0.9100    0.9091      1100



## Making the model more effecient

In [80]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid = GridSearchCV(RandomForestClassifier(), param_grid,
                    cv=3, scoring='f1_weighted', n_jobs=-1, verbose=2)

grid.fit(X_train_resampled, y_train_resampled)

print("Best Parameters:", grid.best_params_)

# Retrain best model on full training data
best_rf = grid.best_estimator_
best_rf.fit(X_train_resampled, y_train_resampled)

# Final evaluation
y_test_pred = best_rf.predict(X_test)
print("\nFinal Evaluation after tuning:")
print(classification_report(y_test, y_test_pred, digits=4))


Fitting 3 folds for each of 18 candidates, totalling 54 fits
Best Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 500}

Final Evaluation after tuning:
              precision    recall  f1-score   support

         1.0     0.9592    0.9592    0.9592       417
         2.0     0.9070    0.9491    0.9276       452
         3.0     0.7981    0.7477    0.7721       111
         4.0     0.9630    0.7879    0.8667        33
         5.0     0.8101    0.7356    0.7711        87

    accuracy                         0.9109      1100
   macro avg     0.8875    0.8359    0.8593      1100
weighted avg     0.9098    0.9109    0.9097      1100



In [None]:
## not as accurate for emails with importance 3.0, 4.0  and 5.0. we can make it more accurate if we have a larger dataset

In [653]:
# import joblib

# # Save the model and pipeline to a file
# joblib.dump(pipeline, 'email_classifier_svm_model_with_tfidf_smote.pkl')


['email_classifier_svm_model_with_tfidf_smote.pkl']

In [82]:
import joblib

# Save the model and pipeline to a file
joblib.dump(best_rf, 'email_classifier_rf_model_with_tfidf_smote.pkl')


['email_classifier_rf_model_with_tfidf_smote.pkl']