# Prioritize Email Model

## Goal
The goal of this notebook is to demonstrate a basic model to prioritize emails into three buckets: slow, default, priority.

## Outcome
- Built a logistic regression model to predict email prioritization. 
- This model performance very well on synthetic data (n=500) with near perfect accuracy, indicating that synthetic data may be too cookie cutter and may not reflect real world varition.
- Model was applied on two real world unlabeld email datasets (Resend spam example) and Enron dataset with visualization below.

## Methodology

### Datasets
Thee datasets were used:
- synthetic labeled dat
  - used to train / evaluate model performance. 
  - Generated using the scripts in utils folder and used to catch several use cases: promotional emails, mfa verfication, time sensitive emails, and non urget emails.
- unlabeled spam detection data 
  - provided by Resend
- unlabled enron email data
  -  More details [here](https://technocrat.github.io/_book/the-enron-email-corpus.html)

## Model training
Model was trained using the synthetic data. Even though there were 10k emails in dataset, only 5k were used to train the model to avoid over training and since synthetic data was pretty simple. Even with limited training data, model performance was greater than 99% accuracy, indicating that synthetic data likely doesn't generalize well to real world. This considered, it will give a good baseline to improve upon as more real data comes in.

## Model evaluation
Three models were considered: logistic regression, random forest, and catboost. Given the simple nature of training data, logistic regression performed well enough and was chosen for its simplicity. We can evaluate later to swicth to more complex model as we get more features and labeled real world data




In [43]:

import sys
import os 
import pandas as pd

module_path = os.path.abspath(os.path.join(os.getcwd(), "../email_prioritizer"))
sys.path.append(module_path)

from email_prioritizer import EmailPrioritizer
from sklearn.model_selection import train_test_split

In [44]:
list_data_files = [
    '../data/mfa_verification_emails.csv',
    '../data/promotional_emails.csv',
    '../data/urgent_time_sensitive_emails.csv',
    '../data/non_urgent_basic_emails.csv'
]

In [45]:

df_emails_data = pd.concat([pd.read_csv(file) for file in list_data_files], ignore_index=True)
df_emails_data.shape

(10000, 2)

In [46]:
num_holdout = 1000
df_emails_train, df_emails_holdout = train_test_split(df_emails_data, test_size=(num_holdout*1./df_emails_data.shape[0]), random_state=42)
df_emails_train

Unnamed: 0,email,label
4896,From: billing@billing.example.com\nSubject: UR...,Prioritize
4782,From: operations@security.company.com\nSubject...,Prioritize
1496,From: offers@doordash.com\nSubject: Member exc...,Slow
1957,From: sales@target.com\nSubject: Your deal is ...,Slow
9171,From: alex@company.com\nSubject: Following up ...,Default
...,...,...
5734,From: sam@project.io\nSubject: Small update on...,Default
5191,From: avery@project.io\nSubject: Quick questio...,Default
5390,From: morgan@project.io\nSubject: Thanks for y...,Default
860,From: alert@wellsfargo.com\nSubject: Identity ...,Prioritize


In [None]:
# df_emails_holdout.to_csv('../data/holdout_emails.csv', index=False)

In [None]:
x_col = 'email'
y_col = 'label'
X = df_emails_train[x_col]
y = df_emails_train[y_col]
X_holdout = df_emails_holdout[x_col]
y_holdout = df_emails_holdout[y_col]
test_size=.9494
# test_size=.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
print(f"Training size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")

Training size: 7200, Test size: 1800


In [49]:

import re
pattern = r"From:\s*(?P<From>.+)\nSubject:\s*(?P<Subject>.+)\n(?P<Body>.+)"
df_email_holdout = X_holdout.str.extract(pattern, flags=re.DOTALL)
df_email_holdout


Unnamed: 0,From,Subject,Body
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l..."
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ..."
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co..."
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...
...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a..."
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.


In [50]:
detector = EmailPrioritizer(model_type="logistic")
detector.fit(X_train, y_train, cv=3)
detector.evaluate(X_test, y_test)
df_results = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector.predict(X_holdout) == y_holdout,
    'score_default': detector.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector.predict_proba(X_holdout)[:,1],
    'score_slow': detector.predict_proba(X_holdout)[:,2],
})
df_results

Running GridSearchCV for logistic...
Fitting 3 folds for each of 4 candidates, totalling 12 fits


  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)
  return _ForkingPickler.loads(res)



Best cross-val f1 score: 1.000
Best parameters: {'model__C': 0.01}

Evaluation Report:
              precision    recall  f1-score   support

     Default       1.00      1.00      1.00       889
  Prioritize       1.00      1.00      1.00       357
        Slow       1.00      1.00      1.00       554

    accuracy                           1.00      1800
   macro avg       1.00      1.00      1.00      1800
weighted avg       1.00      1.00      1.00      1800

Accuracy: 1.0
F1 Score: 1.0


Unnamed: 0,email_from,email_subject,email_body,prediction,true_label,correct,score_default,score_prioritize,score_slow
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l...",Default,Default,True,0.851746,0.060686,0.087567
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ...",Prioritize,Prioritize,True,0.179599,0.680171,0.140229
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...,Slow,Slow,True,0.143280,0.187292,0.669428
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co...",Prioritize,Prioritize,True,0.150521,0.690974,0.158506
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...,Prioritize,Prioritize,True,0.172810,0.672467,0.154723
...,...,...,...,...,...,...,...,...,...
3921,newsletter@costco.com,Exclusive online offer: 25% off everything!,Check out our latest deals — 25% off selected ...,Slow,Slow,True,0.121835,0.079581,0.798584
6685,alex@teamhub.io,Quick question about the Q3 project,"Hello Jamie, Everything seems to be on track a...",Default,Default,True,0.873999,0.052633,0.073368
3194,no-reply@airbnb.com,Spring sale is here! 30% off select items!,Your personalized deal: 30% off selected items...,Slow,Slow,True,0.101986,0.125277,0.772737
1941,promotions@nike.com,Your deal is here: 20% off selected items!,Spring into savings with 20% off select products.,Slow,Slow,True,0.116098,0.088964,0.794938


In [51]:

from pathlib import Path
sys.path.append(str(Path().parent))

In [52]:

import dill

save_model = False
model = detector.pipeline

if save_model:
    os.chdir('../')
    filename = './finalized_model_lr.dill'
    print(os.getcwd())
    with open(filename, 'wb') as file:
        dill.dump(model, file, recurse=True)
    
    with open("./finalized_model_lr.dill", "rb") as f:
        model = dill.load(f)
        print(model)
    os.chdir('./notebooks')  # change back to notebooks directory

In [None]:
detector_rf = EmailPrioritizer(model_type="random_forest")
detector_rf.fit(X_train, y_train, cv=3)
detector_rf.evaluate(X_test, y_test)
df_results_rf = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector_rf.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector_rf.predict(X_holdout) == y_holdout,
    'score_default': detector_rf.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector_rf.predict_proba(X_holdout)[:,1],
    'score_slow': detector_rf.predict_proba(X_holdout)[:,2],
})
df_results_rf

Running GridSearchCV for random_forest...
Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [None]:
detector_cb = EmailPrioritizer(model_type="catboost")
detector_cb.fit(X_train, y_train, cv=3)
detector_cb.evaluate(X_test, y_test)
df_results_cb = pd.DataFrame({
    'email_from': df_email_holdout.From,
    'email_subject': df_email_holdout.Subject,
    'email_body': df_email_holdout.Body,
    'prediction': detector.predict(X_holdout),
    'true_label': y_holdout,
    'correct': detector.predict(X_holdout) == y_holdout,
    'score_default': detector_cb.predict_proba(X_holdout)[:,0],
    'score_prioritize': detector_cb.predict_proba(X_holdout)[:,1],
    'score_slow': detector_cb.predict_proba(X_holdout)[:,2],
})
df_results_cb

Running GridSearchCV for catboost...
Fitting 3 folds for each of 8 candidates, totalling 24 fits

Best cross-val f1 score: 0.990
Best parameters: {'model__depth': 6, 'model__iterations': 200, 'model__learning_rate': 0.1}

Evaluation Report:
              precision    recall  f1-score   support

     Default       0.99      1.00      0.99      4703
  Prioritize       1.00      0.96      0.98      1869
        Slow       0.99      1.00      1.00      2828

    accuracy                           0.99      9400
   macro avg       0.99      0.99      0.99      9400
weighted avg       0.99      0.99      0.99      9400

Accuracy: 0.9918085106382979
F1 Score: 0.9917433246644096


Unnamed: 0,email_from,email_subject,email_body,prediction,true_label,correct,score_default,score_prioritize,score_slow
6252,alex@mail.com,Quick question about marketing materials,"Hey Jess, Here’s a quick summary of where we l...",Default,Default,True,0.992829,0.001591,0.005581
4684,operations@ops.company.com,URGENT: Payment Overdue: Invoice INV-907918,"Hello Casey Garcia, This is a final reminder: ...",Prioritize,Prioritize,True,0.005053,0.984375,0.010572
1731,no-reply@airbnb.com,Early access: 30% off new arrivals!,Take 30% off and enjoy free shipping on select...,Slow,Slow,True,0.005025,0.005188,0.989787
4742,finance@alerts.bank.com,URGENT: Security Alert: Respond by 2025-10-24 ...,"Hello Josh Clark, Reference: case #7861466. Co...",Prioritize,Prioritize,True,0.005917,0.967844,0.026240
4521,operations@billing.example.com,Escalation: server outage ticket needs your ap...,We need a short confirmation (Yes/No) within 6...,Prioritize,Prioritize,True,0.002246,0.992012,0.005742
...,...,...,...,...,...,...,...,...,...
3787,info@rei.com,Today only: 20% off your order!,Your cart is calling! Get 20% off before it’s ...,Slow,Slow,True,0.006371,0.005387,0.988241
9189,taylor@example.org,Ideas for improving website redesign,"Hey Drew, Hope this helps move things forward ...",Default,Default,True,0.994582,0.001442,0.003977
7825,jamie@project.io,Notes from our last meeting,Feel free to make any changes you think are ne...,Default,Default,True,0.995677,0.001206,0.003118
7539,sam@project.io,Notes from our last meeting,"Hey Jess, No major updates, just sharing where...",Default,Default,True,0.987833,0.005691,0.006476


Use logistic regression since this is the simplest and best performing

In [None]:
# Resend dataset (unlabeled)


In [None]:
df_emails_spam = pd.read_csv('../data/email_classification_dataset_resend.csv').sample(n=500, random_state=42)
pattern = r"From:\s*(?P<From>.+)\nSubject:\s*(?P<Subject>.+)\n(?P<Body>.+)"
df_emails_spam_parsed = df_emails_spam.email.str.extract(pattern, flags=re.DOTALL)
df_emails_spam_parsed

Unnamed: 0,From,Subject,Body
6252,friend@personalmail.net,Catching Up - How are you?\n,Thank you for your order #6789. Your items wil...
4684,noreply@softwareupdates.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...
1731,friend@personalmail.net,Photos from the Weekend Trip\n,"Hi everyone, I've uploaded the photos from our..."
4742,deals@best-offers.xyz,Verify Your Bank Details Immediately\n,Invest in our revolutionary new platform and e...
4521,team@projectmanagement.com,Meeting Reminder: Project Alpha\n,Thank you for your order #6789. Your items wil...
...,...,...,...
5170,survey@retailfeedback.com,Weekly Newsletter - Latest Updates\n,This is an automated notification regarding an...
7205,survey@retailfeedback.com,Photos from the Weekend Trip\n,"Hey [Friend's Name], it's been a while! How ha..."
2522,info@customerservice.co,Team Stand-up at 10 AM\n,"Good morning, everyone. Just a quick reminder ..."
2215,friend@personalmail.net,Meeting Reminder: Project Alpha\n,Thank you for reaching out regarding [your inq...


In [None]:

prediction_proba = detector.predict_proba(df_emails_spam['email'])
df_results_resend = pd.DataFrame({
    'email_sender': df_emails_spam_parsed['From'],
    'email_subject': df_emails_spam_parsed['Subject'],
    'email_body': df_emails_spam_parsed['Body'],
    'prediction': detector.predict(df_emails_spam['email']),
    'score_default': prediction_proba[:,0],
    'score_prioritize': prediction_proba[:,1],
    'score_slow': prediction_proba[:,2],
})
df_results_resend

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
6252,friend@personalmail.net,Catching Up - How are you?\n,Thank you for your order #6789. Your items wil...,Slow,0.366309,0.239403,0.394288
4684,noreply@softwareupdates.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...,Default,0.497052,0.242969,0.259979
1731,friend@personalmail.net,Photos from the Weekend Trip\n,"Hi everyone, I've uploaded the photos from our...",Default,0.427011,0.209648,0.363340
4742,deals@best-offers.xyz,Verify Your Bank Details Immediately\n,Invest in our revolutionary new platform and e...,Slow,0.256917,0.277051,0.466032
4521,team@projectmanagement.com,Meeting Reminder: Project Alpha\n,Thank you for your order #6789. Your items wil...,Default,0.434774,0.257608,0.307618
...,...,...,...,...,...,...,...
5170,survey@retailfeedback.com,Weekly Newsletter - Latest Updates\n,This is an automated notification regarding an...,Default,0.442770,0.248885,0.308346
7205,survey@retailfeedback.com,Photos from the Weekend Trip\n,"Hey [Friend's Name], it's been a while! How ha...",Slow,0.348494,0.214277,0.437229
2522,info@customerservice.co,Team Stand-up at 10 AM\n,"Good morning, everyone. Just a quick reminder ...",Default,0.449547,0.196315,0.354137
2215,friend@personalmail.net,Meeting Reminder: Project Alpha\n,Thank you for reaching out regarding [your inq...,Default,0.517278,0.220496,0.262226


In [None]:
df_results_resend.sort_values('score_prioritize', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
2018,admin@bank-verify.org,Important Security Alert: Login from New Device\n,We detected suspicious login attempts on your ...,Prioritize,0.265457,0.546229,0.188314
7601,invest@global-finance.biz,URGENT: Your Account Has Been Compromised!\n,"Dear customer, your account has been temporari...",Prioritize,0.236176,0.533238,0.230587
3570,invest@global-finance.biz,URGENT: Your Account Has Been Compromised!\n,Make thousands of dollars from home with our p...,Prioritize,0.229655,0.52066,0.249685
2124,security@alert-system.ru,Important Security Alert: Login from New Device\n,Invest in our revolutionary new platform and e...,Prioritize,0.230409,0.506897,0.262693
970,support@secure-login.com,URGENT: Your Account Has Been Compromised!\n,Urgent security notification! A login from an ...,Prioritize,0.261261,0.489099,0.249639
4955,security@alert-system.ru,Verify Your Bank Details Immediately\n,Make thousands of dollars from home with our p...,Prioritize,0.251408,0.474616,0.273977
8864,support@secure-login.com,Your Package Is Delayed - Action Required\n,Urgent security notification! A login from an ...,Prioritize,0.273383,0.463586,0.263031
2189,noreply@winner-prize.net,Your Package Is Delayed - Action Required\n,"Dear customer, your account has been temporari...",Prioritize,0.277548,0.459603,0.262849
35,admin@bank-verify.org,Important Security Alert: Login from New Device\n,Invest in our revolutionary new platform and e...,Prioritize,0.306511,0.44124,0.252249
6005,support@secure-login.com,Your Package Is Delayed - Action Required\n,Make thousands of dollars from home with our p...,Prioritize,0.261695,0.439271,0.299034


In [None]:
df_results_resend.sort_values('score_default', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
6039,john.doe@example.com,Important: Software Update Notification\n,"Hey [Friend's Name], it's been a while! How ha...",Default,0.684833,0.146587,0.16858
2922,john.doe@example.com,Important: Software Update Notification\n,"Hey [Friend's Name], it's been a while! How ha...",Default,0.684833,0.146587,0.16858
483,john.doe@example.com,Meeting Reminder: Project Alpha\n,"Good morning, everyone. Just a quick reminder ...",Default,0.663769,0.168133,0.168098
8080,john.doe@example.com,Weekly Newsletter - Latest Updates\n,"Good morning, everyone. Just a quick reminder ...",Default,0.659828,0.169611,0.170561
5309,john.doe@example.com,Important: Software Update Notification\n,"Hi team, just a reminder about our Project Alp...",Default,0.639329,0.176554,0.184117
735,john.doe@example.com,Weekly Newsletter - Latest Updates\n,"Hi team, just a reminder about our Project Alp...",Default,0.639059,0.176681,0.18426
6149,john.doe@example.com,Meeting Reminder: Project Alpha\n,Here's your weekly dose of news and updates fr...,Default,0.630249,0.17492,0.194832
9445,john.doe@example.com,Meeting Reminder: Project Alpha\n,We value your feedback! Please take a few mome...,Default,0.628127,0.175045,0.196828
7618,john.doe@example.com,Weekly Newsletter - Latest Updates\n,Here's your weekly dose of news and updates fr...,Default,0.6261,0.176342,0.197558
4630,john.doe@example.com,Weekly Newsletter - Latest Updates\n,Here's your weekly dose of news and updates fr...,Default,0.6261,0.176342,0.197558


In [None]:
df_results_resend.sort_values('score_slow', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
6765,delivery@package-update.info,Exclusive Investment Opportunity - High Return...,Don't miss out on this incredible offer! All p...,Slow,0.196115,0.145865,0.65802
582,deals@best-offers.xyz,Exclusive Investment Opportunity - High Return...,Boost your social media presence! Get thousand...,Slow,0.222627,0.166714,0.610658
487,deals@best-offers.xyz,Unclaimed Funds Await You\n,Don't miss out on this incredible offer! All p...,Slow,0.23266,0.168418,0.598922
4321,deals@best-offers.xyz,Act Now: Limited Time Offer - 90% Off All Prod...,Your recent order #12345 is delayed. Click her...,Slow,0.239906,0.188036,0.572058
5196,delivery@package-update.info,Exclusive Investment Opportunity - High Return...,Invest in our revolutionary new platform and e...,Slow,0.241634,0.201146,0.55722
5766,delivery@package-update.info,Act Now: Limited Time Offer - 90% Off All Prod...,Boost your social media presence! Get thousand...,Slow,0.256458,0.194431,0.549112
8328,info@unclaimed-funds.co,Act Now: Limited Time Offer - 90% Off All Prod...,Boost your social media presence! Get thousand...,Slow,0.256458,0.194431,0.549112
6163,money@easy-cash.top,Act Now: Limited Time Offer - 90% Off All Prod...,Don't miss out on this incredible offer! All p...,Slow,0.268891,0.196725,0.534384
7850,survey@retailfeedback.com,Photos from the Weekend Trip\n,Thank you for your order #6789. Your items wil...,Slow,0.247284,0.225789,0.526926
8846,support@secure-login.com,Act Now: Limited Time Offer - 90% Off All Prod...,Don't miss out on this incredible offer! All p...,Slow,0.242388,0.236837,0.520775


In [None]:
# Enron emails (unlabeled)

In [None]:
df_emails_data_random = pd.read_csv('../data/emails_enron_sampled.csv')
df_emails_data_random['email'] = df_emails_data_random['message']
df_emails_data_random

Unnamed: 0,file,message,email
0,shackleton-s/sent/1912.,Message-ID: <21013688.1075844564560.JavaMail.e...,Message-ID: <21013688.1075844564560.JavaMail.e...
1,farmer-d/logistics/1066.,Message-ID: <22688499.1075854130303.JavaMail.e...,Message-ID: <22688499.1075854130303.JavaMail.e...
2,parks-j/deleted_items/202.,Message-ID: <27817771.1075841359502.JavaMail.e...,Message-ID: <27817771.1075841359502.JavaMail.e...
3,stokley-c/chris_stokley/iso/client_rep/41.,Message-ID: <10695160.1075858510449.JavaMail.e...,Message-ID: <10695160.1075858510449.JavaMail.e...
4,germany-c/all_documents/1174.,Message-ID: <27819143.1075853689038.JavaMail.e...,Message-ID: <27819143.1075853689038.JavaMail.e...
...,...,...,...
995,lewis-a/deleted_items/893.,Message-ID: <32544367.1075845225273.JavaMail.e...,Message-ID: <32544367.1075845225273.JavaMail.e...
996,taylor-m/notes_inbox/154.,Message-ID: <23601113.1075859986836.JavaMail.e...,Message-ID: <23601113.1075859986836.JavaMail.e...
997,nemec-g/all_documents/5769.,Message-ID: <18300273.1075842783996.JavaMail.e...,Message-ID: <18300273.1075842783996.JavaMail.e...
998,bass-e/all_documents/642.,Message-ID: <31701504.1075854593577.JavaMail.e...,Message-ID: <31701504.1075854593577.JavaMail.e...


In [None]:
import pandas as pd
import re

text = """Message-ID: <17189699.1075863688308.JavaMail.evans@thyme>
Date: Fri, 14 Jul 2000 06:59:00 -0700 (PDT)
From: phillip.allen@enron.com
To: joyce.teixeira@enron.com
Subject: Re: PRC review - phone calls
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Joyce Teixeira
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

any morning between 10 and 11:30"""

pattern = (
    r"Message-ID:\s*<(?P<MessageID>[^>]+)>\s*"
    r"Date:\s*(?P<Date>.*?)\s*"
    r"From:\s*(?P<From>.*?)\s*"
    r"To:\s*(?P<To>.*?)\s*"
    r"Subject:\s*(?P<Subject>.*?)\s*"
    r"Mime-Version:\s*(?P<MimeVersion>.*?)\s*"
    r"Content-Type:\s*(?P<ContentType>[^;]+);\s*charset=(?P<Charset>\S+)\s*"
    r"Content-Transfer-Encoding:\s*(?P<Encoding>.*?)\s*"
    r"X-From:\s*(?P<XFrom>.*?)\s*"
    r"X-To:\s*(?P<XTo>.*?)\s*"
    r"X-cc:\s*(?P<Xcc>.*?)\s*"
    r"X-bcc:\s*(?P<Xbcc>.*?)\s*"
    r"X-Folder:\s*(?P<XFolder>.*?)\s*"
    r"X-Origin:\s*(?P<XOrigin>.*?)\s*"
    r"X-FileName:\s*(?P<XFileName>.*?)\s*"
    r"(?P<Body>[\s\S]*)"
)

s = pd.Series([text])
df = s.str.extract(pattern, flags=re.DOTALL)
print(df.T)


                                                           0
MessageID        17189699.1075863688308.JavaMail.evans@thyme
Date                   Fri, 14 Jul 2000 06:59:00 -0700 (PDT)
From                                 phillip.allen@enron.com
To                                  joyce.teixeira@enron.com
Subject                         Re: PRC review - phone calls
MimeVersion                                              1.0
ContentType                                       text/plain
Charset                                             us-ascii
Encoding                                                7bit
XFrom                                        Phillip K Allen
XTo                                           Joyce Teixeira
Xcc                                                         
Xbcc                                                        
XFolder      \Phillip_Allen_Dec2000\Notes Folders\'sent mail
XOrigin                                              Allen-P
XFileName               

In [None]:
df_parsed_emails = df_emails_data_random.message.str.extract(pattern, flags=re.DOTALL)
# df_parsed_emails

In [None]:
prediction_proba = detector.predict_proba(df_emails_data_random['message'])
df_results_enron = pd.DataFrame({
    'email_sender': df_parsed_emails['From'],
    'email_subject': df_parsed_emails['Subject'],
    'email_body': df_parsed_emails['Body'],
    'prediction': detector.predict(df_emails_data_random['message']),
    'score_default': prediction_proba[:,0],
    'score_prioritize': prediction_proba[:,1],
    'score_slow': prediction_proba[:,2],
})
df_results_enron

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
0,sara.shackleton@enron.com,Re: Credit Derivatives,sshackle.nsf\n\nBill: Thanks for the info. ...,Slow,0.337540,0.273469,0.388991
1,pat.clynes@enron.com,Meter #1591 Lamay Gaslift\nCc: daren.farmer@en...,"dfarmer.nsf\n\nAimee,\nPlease check meter #159...",Slow,0.333946,0.252029,0.414025
2,knipe3@msn.com,Re: man night again?,joe parks 6-26-02.pst\n\nGCCA Crawfish and rip...,Slow,0.303787,0.289281,0.406932
3,kalmeida@caiso.com,"Enron 480, 1480 charges","Stokley, Chris (Non-Privileged).pst\n\n <<Keon...",Slow,0.338272,0.260884,0.400844
4,chris.germany@enron.com,Transport Deal,cgerman.nsf\n\nI'm trying to change the Receip...,Slow,0.242966,0.215996,0.541038
...,...,...,...,...,...,...,...
995,alerts@stockselector.com,mPhase Technologies Honored At SUPERCOMM2001; ...,"Lewis, Andrew H..pst\n\nmPhase Technologies Ho...",Slow,0.303452,0.331312,0.365236
996,fishkinc@hotmail.com,,mtaylor.nsf\n\nAttached is my forthcoming arti...,Slow,0.322265,0.279329,0.398406
997,peter.meier@neg.pge.com,RE: Spread Value Calc.\nCc: barry.tycholiz@enr...,gnemec.nsf\n\nWe will review this in the morni...,Slow,0.324932,0.283245,0.391824
998,lwbthemarine@bigplanet.com,U S M C Birthday,ebass.nsf\n\nGuess what! You have just receive...,Slow,0.317769,0.257751,0.424480


In [None]:
df_results_enron.sort_values('score_prioritize', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
434,tk.lohman@enron.com,Confirmations\nCc: kimberly.watson@enron.com,dschool (Non-Privileged).pst\n\n\nPlease find ...,Prioritize,0.235647,0.45952,0.304833
882,an1229@hotmail.com,Fwd: 809 area code,"lcampbel.nsf\n\n>From: ""duy lam"" <minhduy@hotm...",Prioritize,0.224801,0.452829,0.32237
828,alerts@alerts.equityalert.com,Your News Alert for CMGI,andy lewis 6-25-02.PST\n\n\n =09[IMAGE]=09 =09...,Prioritize,0.280332,0.436882,0.282786
22,,,,Prioritize,0.233544,0.429749,0.336707
151,susan.bailey@enron.com,Intercompany Confirmations,"sbaile2 (Non-Privileged).pst\n\n\nDiane,\n\nAs...",Prioritize,0.259439,0.416841,0.323719
919,stephen.douglas@enron.com,Language for the Confirmation\nCc: jordan.mint...,mtaylor.nsf\n\nBelow is slightly revised langu...,Prioritize,0.259002,0.407711,0.333287
602,phil.demoes@enron.com,FW: ENRON CONFIRMATION LETTER - Pool Gas,"dhyvl.nsf\n\nDan,\n\nPlease note the following...",Prioritize,0.285037,0.40685,0.308113
305,d..steffes@enron.com,Bond Requirement and Other Financial Guarantee...,JSTEFFE (Non-Privileged).pst\n\nDenise --\n\nD...,Prioritize,0.273079,0.40565,0.321271
129,mark.taylor@enron.com,"Re: Consolidated Edison, Inc. Confirmation Let...",mtaylor.nsf\n\nHere is the confirm with my rev...,Prioritize,0.356127,0.391044,0.252828
368,michelle.lokay@enron.com,Re: URGENT REQUEST FROM STEVE,mlokay.nsf\n\nSee below....\n\n\n\n\n\nAudrey ...,Prioritize,0.278295,0.382332,0.339373


In [None]:
df_results_enron.sort_values('score_default', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
721,douglas@chelanpud.org,RE: Business Practice #11,I have to tell you more than you want to know ...,Default,0.585379,0.190427,0.224193
42,osbareport@ohiobar.org,OSBA Report Online HTML Version Volume 74 Issu...,JHODGE (Non-Privileged).pst\n\n\nGoto Online R...,Default,0.572763,0.181524,0.245714
150,cmiller@rice.edu,Re: Meeting Nov 8th\nCc: vince.j.kaminski@enro...,"vkamins.nsf\n\nVince,\n\nI look forward to see...",Default,0.571223,0.194028,0.23475
608,mcunningham@isda.org,Ferrell North America,mhaedic.nsf\n\nMark:\n\nI received an applicat...,Default,0.564051,0.180406,0.255542
263,legal <.taylor@enron.com>,RE: How are you ?,MTAYLO1 (Non-Privileged).pst\n\nThanks for you...,Default,0.538363,0.163473,0.298164
683,legal <.taylor@enron.com>,FW: NYPP Description,MTAYLO1 (Non-Privileged).pst\n\nCould one of y...,Default,0.530707,0.21201,0.257283
394,mark.taylor@enron.com,"Alcoa, Inc.","mtaylor.nsf\n\nI talked to Kathy, gave her som...",Default,0.524313,0.191247,0.28444
317,mark.taylor@enron.com,Re: Court Humor,mtaylor.nsf\n\nThanks - I needed a good smile ...,Default,0.520595,0.180076,0.299329
959,mark.taylor@enron.com,Re: law clerk interviews,mtaylor.nsf\n\n1. Jason Rose\n2. Jason appeare...,Default,0.519535,0.2183,0.262165
164,pennfuture@pennfuture.org,PennFuture's E-cubed - What's It Worth To You?,"Dasovich, Jeff (Non-Privileged).pst\n\nPennFut...",Default,0.51845,0.178987,0.302562


In [None]:
df_results_enron.sort_values('score_slow', ascending=False).head(10)

Unnamed: 0,email_sender,email_subject,email_body,prediction,score_default,score_prioritize,score_slow
720,susan.scott@enron.com,Missing EOL Gas Daily Deals,sscott5.nsf\n\n---------------------- Forwarde...,Slow,0.202199,0.164616,0.633185
101,newsletter@quickinspirations.com,Print FREE Coupons for your Holiday Shopping!,LBLAIR (Non-Privileged).pst\n\n\nQuick Inspira...,Slow,0.214559,0.155829,0.629611
616,continentalvacations@lists.coolvacations.com,Continental Airlines Vacations Special Deals,"kward (Non-Privileged).pst\n\nDear Kim,\n\nEnj...",Slow,0.189137,0.186243,0.624619
581,commerce_opspagemaster@dell.com,Dell Computer - Saved Quote Information,"mtaylor.nsf\n\nDear MARK E TAYLOR,\n\nAn E-Quo...",Slow,0.182238,0.193424,0.624338
934,chris.germany@enron.com,Sale to CES,cgerman.nsf\n\nBought from CPA (deal 157219) s...,Slow,0.205929,0.17356,0.620511
859,specialdeals@lists.em5000.com,"Get the credit, you deserve!",kward (Non-Privileged).pst\n\n\n[IMAGE]\t[IMAG...,Slow,0.200798,0.187088,0.612114
23,susan.scott@enron.com,Sumas deal w/ Larry May,sscott5.nsf\n\nDid any of your traders do a de...,Slow,0.202678,0.189221,0.608101
489,gift@amazon.com,Free Shipping Ends December 4--Shop Today,JDASOVIC (Non-Privileged).pst\n\nDear Amazon.c...,Slow,0.223172,0.183203,0.593625
124,auto-confirm@amazon.com,Your Order with Amazon.com (#105-4008195-1643924),emclaug.nsf\n\nAmazon.com logo\tyour account\n...,Slow,0.219946,0.188881,0.591173
497,chris.germany@enron.com,Deal 628466,cgerman.nsf\n\nI changed the pipeline and the ...,Slow,0.221285,0.188668,0.590046
