## Import and have a Look

In [1]:
import numpy as np
import pandas as pd

import os, gc, re

df = pd.read_csv('/kaggle/input/enron-email-dataset/emails.csv')
df.head(5)

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


Wow, only two columns! What's inside the message? Let's look at an example:

In [2]:
print(df.iloc[22,1])

Message-ID: <26575732.1075855687756.JavaMail.evans@thyme>
Date: Mon, 2 Oct 2000 02:19:00 -0700 (PDT)
From: phillip.allen@enron.com
To: bs_stone@yahoo.com
Subject: Re: Original Sept check/closing
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: "BS Stone" <bs_stone@yahoo.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

Brenda,

 Please use the second check as my October payment.  I have my copy of the 
original deal.  Do you want me to fax this to you?

Phillip


## Information Part

Let's seperate the "info" and "content" parts, and deal with the information part first.

In [3]:
def info_part(i):
    """split infomation part out"""
    return i.split('\n\n', 1)[0]
def content_part(i):
    """split content part out"""
    return i.split('\n\n', 1)[1]
df['pre_info'] = df.message.map(info_part)
df['content'] = df.message.map(content_part)
df['test_true'] = True

words2split = ['Message-ID: ', 'Date: ', 'From: ', 'To: ', 'Subject: ', 'Cc: ', 'Mime-Version: ', 'Content-Type: ',
               'Content-Transfer-Encoding: ', 'Bcc: ', 'X-From: ', 'X-To: ', 'X-cc: ', 'X-bcc: ', 'X-Folder: ', 'X-Origin: ',
               'X-FileName: ']
features_naming = [i[:-2] for i in words2split]
split_condition = '|'.join(words2split)

In [4]:
# Some emails' subject confuse the string-spliting function, so I make a little change
def duplicated_info(i):
    return i.replace(' Date: ', ' Date- ').replace(' Subject: ', ' Subject2: ').replace(' To: ',
                    ' To- ').replace(' (Subject: ', ' (Subject- ')
df['pre_info'] = df['pre_info'].map(duplicated_info)

# let's check how many categories are there in these emails
def num_part(i):
    return len(re.split(split_condition, i))
df['num_info'] = df['pre_info'].map(num_part)

# around 20k emails do not have the 'To: ' category, so I add one
def add_to(i):
    return i.replace('\nSubject: ', '\nTo: \nSubject: ')
temp_condition = (df['num_info'] == 17) | (df['num_info'] == 15)
df.loc[temp_condition, 'pre_info'] = df.loc[temp_condition, 'pre_info'].map(add_to)


# similar way to deal with the "Cc:" and "Bcc:" categories
temp_condition = (df['num_info'] == 16) | (df['num_info'] == 15)
def add_bcc(i):
    return i.replace('\nX-From: ', '\nBcc: \nX-From: ')
df.loc[temp_condition, 'pre_info'] = df.loc[temp_condition, 'pre_info'].map(add_bcc)
def add_cc(i):
    return i.replace('\nMime-Version: ', '\nCc: \nMime-Version: ')
df.loc[temp_condition, 'pre_info'] = df.loc[temp_condition, 'pre_info'].map(add_cc)

Now let's see how many wrong-formatted email are left:

In [5]:
df['num_info'] = df['pre_info'].map(num_part)
df['num_info'].value_counts()

18    517398
5          2
6          1
Name: num_info, dtype: int64

Oh, there are 3 of them.
I would simply choose to print them out to have a look, then remove from dataset.

In [6]:
df_remove = df.loc[df['num_info'] != 18].copy()
df = df.loc[df['num_info'] == 18].copy()

In [7]:
global feature_idx
def info_split(i):
    ## split the i th part out and remove \n for the feature
    return re.split(split_condition, i)[feature_idx+1][:-2]
def info_split_last(i):
    ## no need to remove \n for last category -- X-FileName
    return re.split(split_condition, i)[feature_idx+1]
for feature_idx in range(len(words2split)):
    if feature_idx != len(words2split) - 1:
        df[features_naming[feature_idx]] = df['pre_info'].map(info_split)
    else:
        df[features_naming[feature_idx]] = df['pre_info'].map(info_split_last) 

Let's check one category if I did well:

In [8]:
df['Content-Transfer-Encoding'].value_counts()

7bi                            494994
quoted-printabl                 22399
base6                               4
text/plain; charset=us-asci         1
Name: Content-Transfer-Encoding, dtype: int64

There is still one not quite right, I would just take it away too...

In [9]:
df_remove2 = df.loc[df['Content-Transfer-Encoding'] == 'text/plain; charset=us-asci']
df = df.loc[df['Content-Transfer-Encoding'] != 'text/plain; charset=us-asci']

Have a read at these discarded emails...

In [10]:
# print(df_remove.iloc[0,1])
# print(df_remove2.iloc[0,1])

## Content part 

There are a lot of emails contain non plain English info such as attach file and "Forwarded" message, I discovered that many of them were seperated by "-------------". Therefore I use this to discard these parts and add indicators.

In [11]:
df.loc[df["content"].str.contains("-------------"), "content"]

9         ---------------------- Forwarded by Phillip K ...
12        ---------------------- Forwarded by Phillip K ...
13        ---------------------- Forwarded by Phillip K ...
16        ---------------------- Forwarded by Phillip K ...
18        ---------------------- Forwarded by Phillip K ...
                                ...                        
517175    \n\n -----Original Message-----\nFrom: \tKeoha...
517197    We can have you sit down with John Disturnal, ...
517205    \n\n -----Original Message-----\nFrom: \tkfrog...
517313    \n\n -----Original Message-----\nFrom: \t"Trav...
517321    \n\n -----Original Message-----\nFrom: \t"J&J ...
Name: content, Length: 97360, dtype: object

In [12]:
def split_other_content(i):
    """split other forms of contents out"""
    return i.split('-------------', 1)[0]
df["has_other_content"] = df["content"].str.contains("-------------")
df["if_forwarded"] = df["content"].str.contains("------------- Forwarded")
df['content'] = df.content.map(split_other_content)

I know this is not a perfect way, but it is efficient enough.  
Finally we drop the auxiliary columns and export it:

In [13]:
df = df.drop(['pre_info','test_true', 'num_info'], axis = 1).set_index("file")
df.to_csv("emails_cleaned.csv")

Note that the content part can be cleaned deeper, you will need extra effort to fight with it. Good Luck!

In [14]:
df.head(5)

Unnamed: 0_level_0,message,content,Message-ID,Date,From,To,Subject,Cc,Mime-Version,Content-Type,...,Bcc,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,has_other_content,if_forwarded
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,Here is our forecast\n\n,<18782981.1075855378110.JavaMail.evans@thyme,"Mon, 14 May 2001 16:39:00 -0700 (PDT",phillip.allen@enron.co,tim.belden@enron.co,,,1.0,text/plain; charset=us-asci,...,,Phillip K Alle,Tim Belden <Tim Belden/Enron@EnronXGate,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-,pallen (Non-Privileged).pst,False,False
allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,Traveling to have a business meeting takes the...,<15464986.1075855378456.JavaMail.evans@thyme,"Fri, 4 May 2001 13:51:00 -0700 (PDT",phillip.allen@enron.co,john.lavorato@enron.co,Re,,1.0,text/plain; charset=us-asci,...,,Phillip K Alle,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-,pallen (Non-Privileged).pst,False,False
allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,test successful. way to go!!!,<24216240.1075855687451.JavaMail.evans@thyme,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT",phillip.allen@enron.co,leah.arsdall@enron.co,Re: tes,,1.0,text/plain; charset=us-asci,...,,Phillip K Alle,Leah Van Arsdal,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mai,Allen-,pallen.nsf,False,False
allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,"Randy,\n\n Can you send me a schedule of the s...",<13505866.1075863688222.JavaMail.evans@thyme,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT",phillip.allen@enron.co,randall.gay@enron.co,,,1.0,text/plain; charset=us-asci,...,,Phillip K Alle,Randall L Ga,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mai,Allen-,pallen.nsf,False,False
allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Let's shoot for Tuesday at 11:45.,<30922949.1075863688243.JavaMail.evans@thyme,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT",phillip.allen@enron.co,greg.piper@enron.co,Re: Hell,,1.0,text/plain; charset=us-asci,...,,Phillip K Alle,Greg Pipe,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mai,Allen-,pallen.nsf,False,False


### If this kernel helps you, please upvote!