# Prioritize Email Model

## Goal
The goal of this notebook is to demonstrate a basic model to prioritize emails into three buckets: slow, default, priority.

## Outcome
- Built a logistic regression model to predict email prioritization. 
- This model performance very well on synthetic data (n=500) with near perfect accuracy, indicating that synthetic data may be too cookie cutter and may not reflect real world varition.
- Model was applied on two real world unlabeld email datasets (Resend spam example) and Enron dataset with visualization below.

## Methodology

### Datasets
Thee datasets were used:
- synthetic labeled dat
  - used to train / evaluate model performance. 
  - Generated using the scripts in utils folder and used to catch several use cases: promotional emails, mfa verfication, time sensitive emails, and non urget emails.
- unlabeled spam detection data 
  - provided by Resend
- unlabled enron email data
  -  More details [here](https://technocrat.github.io/_book/the-enron-email-corpus.html)

## Model training
Model was trained using the synthetic data. Even though there were 10k emails in dataset, only 5k were used to train the model to avoid over training and since synthetic data was pretty simple. Even with limited training data, model performance was greater than 99% accuracy, indicating that synthetic data likely doesn't generalize well to real world. This considered, it will give a good baseline to improve upon as more real data comes in.

## Model evaluation
Three models were considered: logistic regression, random forest, and catboost. Given the simple nature of training data, logistic regression performed well enough and was chosen for its simplicity. We can evaluate later to swicth to more complex model as we get more features and labeled real world data




In [1]:

import sys
import os 
import pandas as pd

module_path = os.path.abspath(os.path.join(os.getcwd(), "../email_prioritizer"))
sys.path.append(module_path)

from email_prioritizer import EmailPrioritizer
from sklearn.model_selection import train_test_split

In [2]:
list_data_files = [
    '../data/mfa_verification_emails.csv',
    '../data/promotional_emails.csv',
    '../data/urgent_time_sensitive_emails.csv',
    '../data/non_urgent_basic_emails.csv'
]

In [3]:

df_emails_data = pd.concat([pd.read_csv(file) for file in list_data_files], ignore_index=True)
df_emails_data.shape

(10000, 2)

In [4]:
num_holdout = 1000
df_emails_train, df_emails_holdout = train_test_split(df_emails_data, test_size=(num_holdout*1./df_emails_data.shape[0]), random_state=42)
df_emails_train

Unnamed: 0,email,label
4896,From: billing@billing.example.com\nSubject: UR...,Prioritize
4782,From: operations@security.company.com\nSubject...,Prioritize
1496,From: offers@doordash.com\nSubject: Member exc...,Slow
1957,From: sales@target.com\nSubject: Your deal is ...,Slow
9171,From: alex@company.com\nSubject: Following up ...,Default
...,...,...
5734,From: sam@project.io\nSubject: Small update on...,Default
5191,From: avery@project.io\nSubject: Quick questio...,Default
5390,From: morgan@project.io\nSubject: Thanks for y...,Default
860,From: alert@wellsfargo.com\nSubject: Identity ...,Prioritize


In [5]:
# df_emails_holdout.to_csv('../data/holdout_emails.csv', index=False)

In [6]:
x_col = 'email'
y_col = 'label'
X = df_emails_train[x_col]
y = df_emails_train[y_col]
X_holdout = df_emails_holdout[x_col]
y_holdout = df_emails_holdout[y_col]
test_size=.9444
# test_size=.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
print(f"Training size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")

Training size: 500, Test size: 8500


In [7]:

import re
pattern = r"From:\s*(?P<From>.+)\nSubject:\s*(?P<Subject>.+)\n(?P<Body>.+)"
df_email_holdout = X_holdout.str.extract(pattern, flags=re.DOTALL)
df_email_holdout


Unnamed: 0,From,Subject,Body
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l..."
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ..."
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co..."
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...
...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a..."
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.


In [8]:
detector = EmailPrioritizer(model_type="logistic")
detector.fit(X_train, y_train, cv=3)
detector.evaluate(X_test, y_test)
df_results = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector.predict(X_holdout) == y_holdout,
    'score_default': detector.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector.predict_proba(X_holdout)[:,1],
    'score_slow': detector.predict_proba(X_holdout)[:,2],
})
df_results

Running GridSearchCV for logistic...
Fitting 3 folds for each of 4 candidates, totalling 12 fits





Best cross-val f1 score: 1.000
Best parameters: {'model__C': 0.1}

Evaluation Report:
              precision    recall  f1-score   support

     Default       1.00      1.00      1.00      4244
  Prioritize       1.00      1.00      1.00      1674
        Slow       1.00      1.00      1.00      2582

    accuracy                           1.00      8500
   macro avg       1.00      1.00      1.00      8500
weighted avg       1.00      1.00      1.00      8500

Accuracy: 1.0
F1 Score: 1.0


Unnamed: 0,email_from,email_subject,email_body,prediction,true_label,correct,score_default,score_prioritize,score_slow
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l...",Default,Default,True,0.833174,0.072915,0.093912
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ...",Prioritize,Prioritize,True,0.210559,0.628302,0.161139
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...,Slow,Slow,True,0.177824,0.223684,0.598492
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co...",Prioritize,Prioritize,True,0.169459,0.666958,0.163583
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...,Prioritize,Prioritize,True,0.192651,0.646369,0.160980
...,...,...,...,...,...,...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...,Slow,Slow,True,0.190306,0.128548,0.681146
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a...",Default,Default,True,0.845613,0.067461,0.086926
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...,Slow,Slow,True,0.128368,0.151655,0.719977
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.,Slow,Slow,True,0.144881,0.109942,0.745177


In [9]:

from pathlib import Path
sys.path.append(str(Path().parent))

In [10]:

import dill

save_model = False
model = detector.pipeline

if save_model:
    os.chdir('../')
    filename = './finalized_model_lr.dill'
    print(os.getcwd())
    with open(filename, 'wb') as file:
        dill.dump(model, file, recurse=True)
    
    with open("./finalized_model_lr.dill", "rb") as f:
        model = dill.load(f)
        print(model)
    os.chdir('./notebooks')  # change back to notebooks directory

In [11]:
detector_rf = EmailPrioritizer(model_type="random_forest")
detector_rf.fit(X_train, y_train, cv=3)
detector_rf.evaluate(X_test, y_test)
df_results_rf = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector_rf.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector_rf.predict(X_holdout) == y_holdout,
    'score_default': detector_rf.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector_rf.predict_proba(X_holdout)[:,1],
    'score_slow': detector_rf.predict_proba(X_holdout)[:,2],
})
df_results_rf

Running GridSearchCV for random_forest...
Fitting 3 folds for each of 12 candidates, totalling 36 fits

Best cross-val f1 score: 1.000
Best parameters: {'model__max_depth': 20, 'model__min_samples_split': 2, 'model__n_estimators': 100}

Evaluation Report:
              precision    recall  f1-score   support

     Default       1.00      1.00      1.00      4244
  Prioritize       1.00      0.99      0.99      1674
        Slow       0.99      1.00      1.00      2582

    accuracy                           1.00      8500
   macro avg       1.00      1.00      1.00      8500
weighted avg       1.00      1.00      1.00      8500

Accuracy: 0.9976470588235294
F1 Score: 0.9976433322573789


Unnamed: 0,email_from,email_subject,email_body,prediction,true_label,correct,score_default,score_prioritize,score_slow
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l...",Default,Default,True,0.935725,0.021530,0.042745
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ...",Prioritize,Prioritize,True,0.031164,0.922467,0.046369
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...,Slow,Slow,True,0.221707,0.024827,0.753466
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co...",Prioritize,Prioritize,True,0.021422,0.904334,0.074244
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...,Prioritize,Prioritize,True,0.001604,0.961594,0.036803
...,...,...,...,...,...,...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...,Slow,Slow,True,0.029971,0.023267,0.946762
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a...",Default,Default,True,0.967191,0.012793,0.020016
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...,Slow,Slow,True,0.020584,0.026783,0.952634
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.,Slow,Slow,True,0.014962,0.019961,0.965076


In [12]:
detector_cb = EmailPrioritizer(model_type="catboost")
detector_cb.fit(X_train, y_train, cv=3)
detector_cb.evaluate(X_test, y_test)
df_results_cb = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector.predict(X_holdout) == y_holdout,
    'score_default': detector_cb.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector_cb.predict_proba(X_holdout)[:,1],
    'score_slow': detector_cb.predict_proba(X_holdout)[:,2],
})
df_results_cb

Running GridSearchCV for catboost...
Fitting 3 folds for each of 8 candidates, totalling 24 fits

Best cross-val f1 score: 0.986
Best parameters: {'model__depth': 6, 'model__iterations': 200, 'model__learning_rate': 0.1}

Evaluation Report:
              precision    recall  f1-score   support

     Default       0.99      1.00      1.00      4244
  Prioritize       1.00      0.97      0.99      1674
        Slow       0.99      1.00      1.00      2582

    accuracy                           0.99      8500
   macro avg       1.00      0.99      0.99      8500
weighted avg       0.99      0.99      0.99      8500

Accuracy: 0.9947058823529412
F1 Score: 0.9946788090095878


Unnamed: 0,email_from,email_subject,email_body,prediction,true_label,correct,score_default,score_prioritize,score_slow
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l...",Default,Default,True,0.996128,0.000630,0.003242
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ...",Prioritize,Prioritize,True,0.007121,0.979690,0.013190
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...,Slow,Slow,True,0.005984,0.003856,0.990160
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co...",Prioritize,Prioritize,True,0.022266,0.947107,0.030627
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...,Prioritize,Prioritize,True,0.004094,0.989346,0.006560
...,...,...,...,...,...,...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...,Slow,Slow,True,0.004937,0.003441,0.991622
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a...",Default,Default,True,0.995689,0.000919,0.003392
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...,Slow,Slow,True,0.004357,0.003142,0.992501
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.,Slow,Slow,True,0.007147,0.004335,0.988518


Use logistic regression since this is the simplest and best performing

In [13]:
# Resend dataset (unlabeled)


In [14]:
df_emails_spam = pd.read_csv('../data/email_classification_dataset_resend.csv').sample(n=500, random_state=42)
pattern = r"From:\s*(?P<From>.+)\nSubject:\s*(?P<Subject>.+)\n(?P<Body>.+)"
df_emails_spam_parsed = df_emails_spam.email.str.extract(pattern, flags=re.DOTALL)
df_emails_spam_parsed

Unnamed: 0,From,Subject,Body
6252,friend@personalmail.net,Catching Up - How are you?\n,Thank you for your order #6789. Your items wil...
4684,noreply@softwareupdates.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...
1731,friend@personalmail.net,Photos from the Weekend Trip\n,"Hi everyone, I've uploaded the photos from our..."
4742,deals@best-offers.xyz,Verify Your Bank Details Immediately\n,Invest in our revolutionary new platform and e...
4521,team@projectmanagement.com,Meeting Reminder: Project Alpha\n,Thank you for your order #6789. Your items wil...
...,...,...,...
5170,survey@retailfeedback.com,Weekly Newsletter - Latest Updates\n,This is an automated notification regarding an...
7205,survey@retailfeedback.com,Photos from the Weekend Trip\n,"Hey [Friend's Name], it's been a while! How ha..."
2522,info@customerservice.co,Team Stand-up at 10 AM\n,"Good morning, everyone. Just a quick reminder ..."
2215,friend@personalmail.net,Meeting Reminder: Project Alpha\n,Thank you for reaching out regarding [your inq...


In [15]:

prediction_proba = detector.predict_proba(df_emails_spam['email'])
df_results_resend = pd.DataFrame({
    'email_sender': df_emails_spam_parsed['From'],
    'email_subject': df_emails_spam_parsed['Subject'],
    'email_body': df_emails_spam_parsed['Body'],
    'prediction': detector.predict(df_emails_spam['email']),
    'score_default': prediction_proba[:,0],
    'score_prioritize': prediction_proba[:,1],
    'score_slow': prediction_proba[:,2],
})
df_results_resend

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
6252,friend@personalmail.net,Catching Up - How are you?\n,Thank you for your order #6789. Your items wil...,Slow,0.368353,0.248976,0.382671
4684,noreply@softwareupdates.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...,Default,0.494233,0.256467,0.249300
1731,friend@personalmail.net,Photos from the Weekend Trip\n,"Hi everyone, I've uploaded the photos from our...",Default,0.439081,0.213371,0.347548
4742,deals@best-offers.xyz,Verify Your Bank Details Immediately\n,Invest in our revolutionary new platform and e...,Slow,0.238963,0.251451,0.509586
4521,team@projectmanagement.com,Meeting Reminder: Project Alpha\n,Thank you for your order #6789. Your items wil...,Default,0.447112,0.241172,0.311715
...,...,...,...,...,...,...,...
5170,survey@retailfeedback.com,Weekly Newsletter - Latest Updates\n,This is an automated notification regarding an...,Default,0.389434,0.275267,0.335298
7205,survey@retailfeedback.com,Photos from the Weekend Trip\n,"Hey [Friend's Name], it's been a while! How ha...",Slow,0.311532,0.227148,0.461320
2522,info@customerservice.co,Team Stand-up at 10 AM\n,"Good morning, everyone. Just a quick reminder ...",Default,0.473288,0.209986,0.316726
2215,friend@personalmail.net,Meeting Reminder: Project Alpha\n,Thank you for reaching out regarding [your inq...,Default,0.517082,0.228901,0.254017


In [16]:
df_results_resend.sort_values('score_prioritize', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
2018,admin@bank-verify.org,Important Security Alert: Login from New Device\n,We detected suspicious login attempts on your ...,Prioritize,0.237406,0.597751,0.164843
5703,support@secure-login.com,Important Security Alert: Login from New Device\n,Invest in our revolutionary new platform and e...,Prioritize,0.203029,0.541321,0.25565
970,support@secure-login.com,URGENT: Your Account Has Been Compromised!\n,Urgent security notification! A login from an ...,Prioritize,0.240753,0.527467,0.231781
7601,invest@global-finance.biz,URGENT: Your Account Has Been Compromised!\n,"Dear customer, your account has been temporari...",Prioritize,0.245252,0.524159,0.230589
9411,support@legitcompany.com,Your Order #6789 Confirmed\n,Please find attached your invoice for the serv...,Prioritize,0.203907,0.523515,0.272578
2124,security@alert-system.ru,Important Security Alert: Login from New Device\n,Invest in our revolutionary new platform and e...,Prioritize,0.225034,0.518841,0.256124
6552,support@secure-login.com,Congratulations! You've Won a Free iPhone!\n,We detected suspicious login attempts on your ...,Prioritize,0.252862,0.504065,0.243073
4367,support@legitcompany.com,Invoice for Services Rendered\n,Please find attached your invoice for the serv...,Prioritize,0.246352,0.496803,0.256846
8864,support@secure-login.com,Your Package Is Delayed - Action Required\n,Urgent security notification! A login from an ...,Prioritize,0.260281,0.485868,0.253851
35,admin@bank-verify.org,Important Security Alert: Login from New Device\n,Invest in our revolutionary new platform and e...,Prioritize,0.285619,0.480355,0.234025


In [17]:
df_results_resend.sort_values('score_default', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
6039,john.doe@example.com,Important: Software Update Notification\n,"Hey [Friend's Name], it's been a while! How ha...",Default,0.671815,0.146626,0.181559
2922,john.doe@example.com,Important: Software Update Notification\n,"Hey [Friend's Name], it's been a while! How ha...",Default,0.671815,0.146626,0.181559
483,john.doe@example.com,Meeting Reminder: Project Alpha\n,"Good morning, everyone. Just a quick reminder ...",Default,0.669228,0.161376,0.169396
9445,john.doe@example.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...,Default,0.650406,0.167662,0.181932
8080,john.doe@example.com,Weekly Newsletter - Latest Updates\n,"Good morning, everyone. Just a quick reminder ...",Default,0.647315,0.171648,0.181037
6149,john.doe@example.com,Meeting Reminder: Project Alpha\n,Here's your weekly dose of news and updates fr...,Default,0.643381,0.166273,0.190346
5309,john.doe@example.com,Important: Software Update Notification\n,"Hi team, just a reminder about our Project Alp...",Default,0.641976,0.171428,0.186596
735,john.doe@example.com,Weekly Newsletter - Latest Updates\n,"Hi team, just a reminder about our Project Alp...",Default,0.633466,0.175614,0.19092
4630,john.doe@example.com,Weekly Newsletter - Latest Updates\n,Here's your weekly dose of news and updates fr...,Default,0.620701,0.176399,0.2029
7618,john.doe@example.com,Weekly Newsletter - Latest Updates\n,Here's your weekly dose of news and updates fr...,Default,0.620701,0.176399,0.2029


In [18]:
df_results_resend.sort_values('score_slow', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
487,deals@best-offers.xyz,Unclaimed Funds Await You\n,Don't miss out on this incredible offer! All p...,Slow,0.205877,0.154008,0.640115
582,deals@best-offers.xyz,Exclusive Investment Opportunity - High Return...,Boost your social media presence! Get thousand...,Slow,0.208142,0.158382,0.633476
6765,delivery@package-update.info,Exclusive Investment Opportunity - High Return...,Don't miss out on this incredible offer! All p...,Slow,0.231802,0.171002,0.597195
4321,deals@best-offers.xyz,Act Now: Limited Time Offer - 90% Off All Prod...,Your recent order #12345 is delayed. Click her...,Slow,0.230079,0.182318,0.587603
4423,deals@best-offers.xyz,Get Rich Quick Scheme - No Experience Needed\n,Your recent order #12345 is delayed. Click her...,Slow,0.252465,0.199581,0.547955
9187,deals@best-offers.xyz,Get Rich Quick Scheme - No Experience Needed\n,Your recent order #12345 is delayed. Click her...,Slow,0.252465,0.199581,0.547955
5589,deals@best-offers.xyz,Congratulations! You've Won a Free iPhone!\n,Make thousands of dollars from home with our p...,Slow,0.250385,0.207972,0.541643
7850,survey@retailfeedback.com,Photos from the Weekend Trip\n,Thank you for your order #6789. Your items wil...,Slow,0.23262,0.24359,0.52379
9317,deals@best-offers.xyz,New Followers on Your Social Media!\n,Make thousands of dollars from home with our p...,Slow,0.270483,0.206368,0.523148
7825,deals@best-offers.xyz,Verify Your Bank Details Immediately\n,Your recent order #12345 is delayed. Click her...,Slow,0.24352,0.233598,0.522882


In [19]:
# Enron emails (unlabeled)

In [20]:
df_emails_data_random = pd.read_csv('../data/emails_enron_sampled.csv')
df_emails_data_random['email'] = df_emails_data_random['message']
df_emails_data_random

Unnamed: 0,file,message,email
0,shackleton-s/sent/1912.,Message-ID: <21013688.1075844564560.JavaMail.e...,Message-ID: <21013688.1075844564560.JavaMail.e...
1,farmer-d/logistics/1066.,Message-ID: <22688499.1075854130303.JavaMail.e...,Message-ID: <22688499.1075854130303.JavaMail.e...
2,parks-j/deleted_items/202.,Message-ID: <27817771.1075841359502.JavaMail.e...,Message-ID: <27817771.1075841359502.JavaMail.e...
3,stokley-c/chris_stokley/iso/client_rep/41.,Message-ID: <10695160.1075858510449.JavaMail.e...,Message-ID: <10695160.1075858510449.JavaMail.e...
4,germany-c/all_documents/1174.,Message-ID: <27819143.1075853689038.JavaMail.e...,Message-ID: <27819143.1075853689038.JavaMail.e...
...,...,...,...
995,lewis-a/deleted_items/893.,Message-ID: <32544367.1075845225273.JavaMail.e...,Message-ID: <32544367.1075845225273.JavaMail.e...
996,taylor-m/notes_inbox/154.,Message-ID: <23601113.1075859986836.JavaMail.e...,Message-ID: <23601113.1075859986836.JavaMail.e...
997,nemec-g/all_documents/5769.,Message-ID: <18300273.1075842783996.JavaMail.e...,Message-ID: <18300273.1075842783996.JavaMail.e...
998,bass-e/all_documents/642.,Message-ID: <31701504.1075854593577.JavaMail.e...,Message-ID: <31701504.1075854593577.JavaMail.e...


In [21]:
import pandas as pd
import re

text = """Message-ID: <17189699.1075863688308.JavaMail.evans@thyme>
Date: Fri, 14 Jul 2000 06:59:00 -0700 (PDT)
From: phillip.allen@enron.com
To: joyce.teixeira@enron.com
Subject: Re: PRC review - phone calls
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Joyce Teixeira
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

any morning between 10 and 11:30"""

pattern = (
    r"Message-ID:\s*<(?P<MessageID>[^>]+)>\s*"
    r"Date:\s*(?P<Date>.*?)\s*"
    r"From:\s*(?P<From>.*?)\s*"
    r"To:\s*(?P<To>.*?)\s*"
    r"Subject:\s*(?P<Subject>.*?)\s*"
    r"Mime-Version:\s*(?P<MimeVersion>.*?)\s*"
    r"Content-Type:\s*(?P<ContentType>[^;]+);\s*charset=(?P<Charset>\S+)\s*"
    r"Content-Transfer-Encoding:\s*(?P<Encoding>.*?)\s*"
    r"X-From:\s*(?P<XFrom>.*?)\s*"
    r"X-To:\s*(?P<XTo>.*?)\s*"
    r"X-cc:\s*(?P<Xcc>.*?)\s*"
    r"X-bcc:\s*(?P<Xbcc>.*?)\s*"
    r"X-Folder:\s*(?P<XFolder>.*?)\s*"
    r"X-Origin:\s*(?P<XOrigin>.*?)\s*"
    r"X-FileName:\s*(?P<XFileName>.*?)\s*"
    r"(?P<Body>[\s\S]*)"
)

s = pd.Series([text])
df = s.str.extract(pattern, flags=re.DOTALL)
print(df.T)


                                                           0
MessageID        17189699.1075863688308.JavaMail.evans@thyme
Date                   Fri, 14 Jul 2000 06:59:00 -0700 (PDT)
From                                 phillip.allen@enron.com
To                                  joyce.teixeira@enron.com
Subject                         Re: PRC review - phone calls
MimeVersion                                              1.0
ContentType                                       text/plain
Charset                                             us-ascii
Encoding                                                7bit
XFrom                                        Phillip K Allen
XTo                                           Joyce Teixeira
Xcc                                                         
Xbcc                                                        
XFolder      \Phillip_Allen_Dec2000\Notes Folders\'sent mail
XOrigin                                              Allen-P
XFileName               

In [22]:
df_parsed_emails = df_emails_data_random.message.str.extract(pattern, flags=re.DOTALL)
# df_parsed_emails

In [23]:
prediction_proba = detector.predict_proba(df_emails_data_random['message'])
df_results_enron = pd.DataFrame({
    'email_sender': df_parsed_emails['From'],
    'email_subject': df_parsed_emails['Subject'],
    'email_body': df_parsed_emails['Body'],
    'prediction': detector.predict(df_emails_data_random['message']),
    'score_default': prediction_proba[:,0],
    'score_prioritize': prediction_proba[:,1],
    'score_slow': prediction_proba[:,2],
})
df_results_enron

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
0,sara.shackleton@enron.com,Re: Credit Derivatives,sshackle.nsf\n\nBill: Thanks for the info. ...,Slow,0.321537,0.278949,0.399514
1,pat.clynes@enron.com,Meter #1591 Lamay Gaslift\nCc: daren.farmer@en...,"dfarmer.nsf\n\nAimee,\nPlease check meter #159...",Slow,0.304532,0.273148,0.422320
2,knipe3@msn.com,Re: man night again?,joe parks 6-26-02.pst\n\nGCCA Crawfish and rip...,Slow,0.285324,0.295649,0.419027
3,kalmeida@caiso.com,"Enron 480, 1480 charges","Stokley, Chris (Non-Privileged).pst\n\n <<Keon...",Slow,0.312818,0.280660,0.406522
4,chris.germany@enron.com,Transport Deal,cgerman.nsf\n\nI'm trying to change the Receip...,Slow,0.227007,0.225587,0.547407
...,...,...,...,...,...,...,...
995,alerts@stockselector.com,mPhase Technologies Honored At SUPERCOMM2001; ...,"Lewis, Andrew H..pst\n\nmPhase Technologies Ho...",Slow,0.298455,0.336495,0.365049
996,fishkinc@hotmail.com,,mtaylor.nsf\n\nAttached is my forthcoming arti...,Slow,0.303605,0.294321,0.402075
997,peter.meier@neg.pge.com,RE: Spread Value Calc.\nCc: barry.tycholiz@enr...,gnemec.nsf\n\nWe will review this in the morni...,Slow,0.295192,0.298753,0.406055
998,lwbthemarine@bigplanet.com,U S M C Birthday,ebass.nsf\n\nGuess what! You have just receive...,Slow,0.283570,0.278220,0.438211


In [24]:
df_results_enron.sort_values('score_prioritize', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
434,tk.lohman@enron.com,Confirmations\nCc: kimberly.watson@enron.com,dschool (Non-Privileged).pst\n\n\nPlease find ...,Prioritize,0.207008,0.490606,0.302386
22,,,,Prioritize,0.204983,0.481104,0.313913
151,susan.bailey@enron.com,Intercompany Confirmations,"sbaile2 (Non-Privileged).pst\n\n\nDiane,\n\nAs...",Prioritize,0.221555,0.464158,0.314287
828,alerts@alerts.equityalert.com,Your News Alert for CMGI,andy lewis 6-25-02.PST\n\n\n =09[IMAGE]=09 =09...,Prioritize,0.262575,0.453577,0.283848
882,an1229@hotmail.com,Fwd: 809 area code,"lcampbel.nsf\n\n>From: ""duy lam"" <minhduy@hotm...",Prioritize,0.20567,0.448017,0.346313
919,stephen.douglas@enron.com,Language for the Confirmation\nCc: jordan.mint...,mtaylor.nsf\n\nBelow is slightly revised langu...,Prioritize,0.227704,0.445538,0.326758
602,phil.demoes@enron.com,FW: ENRON CONFIRMATION LETTER - Pool Gas,"dhyvl.nsf\n\nDan,\n\nPlease note the following...",Prioritize,0.251043,0.435178,0.313779
309,arsystem@mailman.enron.com,Your Approval is Overdue: Access Request for\n...,MHAEDIC (Non-Privileged).pst\n\nThis request h...,Prioritize,0.204827,0.428332,0.366841
129,mark.taylor@enron.com,"Re: Consolidated Edison, Inc. Confirmation Let...",mtaylor.nsf\n\nHere is the confirm with my rev...,Prioritize,0.336413,0.425341,0.238246
452,arsystem@mailman.enron.com,Request Submitted: Access Request for sean.rio...,emclaug.nsf\n\nYou have received this email be...,Prioritize,0.212117,0.41274,0.375144


In [25]:
df_results_enron.sort_values('score_default', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
721,douglas@chelanpud.org,RE: Business Practice #11,I have to tell you more than you want to know ...,Default,0.607361,0.187097,0.205542
42,osbareport@ohiobar.org,OSBA Report Online HTML Version Volume 74 Issu...,JHODGE (Non-Privileged).pst\n\n\nGoto Online R...,Default,0.601214,0.178703,0.220082
608,mcunningham@isda.org,Ferrell North America,mhaedic.nsf\n\nMark:\n\nI received an applicat...,Default,0.599242,0.16741,0.233348
970,jgallagher@epsa.org,Comparison Matrix of FERC Filing Requirements,jsteffe (Non-Privileged).pst\n\nAttached is a ...,Default,0.553379,0.219198,0.227423
164,pennfuture@pennfuture.org,PennFuture's E-cubed - What's It Worth To You?,"Dasovich, Jeff (Non-Privileged).pst\n\nPennFut...",Default,0.545594,0.172115,0.282291
150,cmiller@rice.edu,Re: Meeting Nov 8th\nCc: vince.j.kaminski@enro...,"vkamins.nsf\n\nVince,\n\nI look forward to see...",Default,0.541363,0.206107,0.252529
263,legal <.taylor@enron.com>,RE: How are you ?,MTAYLO1 (Non-Privileged).pst\n\nThanks for you...,Default,0.537062,0.165064,0.297874
416,akatz@eei.org,Power Marketing Transactions Using the EEI/NEM...,esager.nsf\n\nMark your calendar to attend the...,Default,0.524673,0.188702,0.286625
683,legal <.taylor@enron.com>,FW: NYPP Description,MTAYLO1 (Non-Privileged).pst\n\nCould one of y...,Default,0.522559,0.217015,0.260426
317,mark.taylor@enron.com,Re: Court Humor,mtaylor.nsf\n\nThanks - I needed a good smile ...,Default,0.506381,0.181318,0.312301


In [26]:
df_results_enron.sort_values('score_slow', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
934,chris.germany@enron.com,Sale to CES,cgerman.nsf\n\nBought from CPA (deal 157219) s...,Slow,0.176265,0.171372,0.652363
720,susan.scott@enron.com,Missing EOL Gas Daily Deals,sscott5.nsf\n\n---------------------- Forwarde...,Slow,0.184006,0.169943,0.646051
859,specialdeals@lists.em5000.com,"Get the credit, you deserve!",kward (Non-Privileged).pst\n\n\n[IMAGE]\t[IMAG...,Slow,0.171948,0.183722,0.64433
101,newsletter@quickinspirations.com,Print FREE Coupons for your Holiday Shopping!,LBLAIR (Non-Privileged).pst\n\n\nQuick Inspira...,Slow,0.205241,0.157101,0.637658
495,pthompson@akllp.com,Today's GE Conference Call\nCc: kay.mann@enron...,kmann.nsf\n\nHas today's call been rescheduled...,Slow,0.174039,0.205649,0.620312
23,susan.scott@enron.com,Sumas deal w/ Larry May,sscott5.nsf\n\nDid any of your traders do a de...,Slow,0.186959,0.193818,0.619224
581,commerce_opspagemaster@dell.com,Dell Computer - Saved Quote Information,"mtaylor.nsf\n\nDear MARK E TAYLOR,\n\nAn E-Quo...",Slow,0.174722,0.214094,0.611183
618,enron.announcements@enron.com,ClickAtHome Pilot Two - IMPORTANT - PC Orderin...,dhyvl.nsf\n\nWe are excited to announce the co...,Slow,0.173587,0.216101,0.610312
616,continentalvacations@lists.coolvacations.com,Continental Airlines Vacations Special Deals,"kward (Non-Privileged).pst\n\nDear Kim,\n\nEnj...",Slow,0.186272,0.211943,0.601785
262,phillip.love@enron.com,transport deal from friday,plove.nsf\n\nI ran a forwards detail for you o...,Slow,0.21353,0.186448,0.600022
