# Handling and analysis of the Enron Email Dataset - Part 1


------------------------------

## The class definitions


* Class: EnronEmailParser: Parser for the emails included in the [Enron Email Dataset](http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz).  
_Note: This particular implementation treats all recipients including to, cc and bcc recipients as same type._
* Class: EnronEmailDataset: Data handler for the Enron Email Dataset  
_Note1: It relies on the EnronEmailParser class to do the actual email parsing._  
_Note2: It uses pandas dataframes as the data storage objects._

## Basic Setup

Having defined the basic classes that will handle the data and parsing for us, we can now start to load and parse our data. The two main tables, aka dataframes, are shown below (limited to the top 5 rows in each case).

In [1]:
import pandas as pd
from enrondatahandling import EnronEmailDataset

In [2]:
# Load and parse the enron email dataset
enronData = EnronEmailDataset('./data')

Surveyed 1702 email files
Parsed 1702 emails


In [3]:
# Let's take a look at the emails table
enronData.emails.head()

Unnamed: 0_level_0,ts,datetime,sender,num_recipients,subject,num_lines_in_msg
email_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
./data/4/54650.txt,993726297,2001-06-28 04:04:57-07:00,j.kaminski@enron.com,1,RE: Thu evening,78
./data/6/173776.txt,963928140,2000-07-18 06:49:00-07:00,steven.kean@enron.com,1,Re: Price Cap Media--DRAFT,81
./data/1/138102.txt,1005755746,2001-11-14 08:35:46-08:00,john.shelk@enron.com,2,RE: Dynegy/Enron Point of Contact,51
./data/1/173413.txt,951069180,2000-02-20 09:53:00-08:00,steven.kean@enron.com,3,Re: Trade Mission,315
./data/1/219048.txt,997483225,2001-08-10 15:40:25-07:00,ray.alvarez@enron.com,4,CONFIDENTIAL Attached file,15


In [4]:
# The recipients table is being maintained separately so as to not keep lists as values in the dataframe
enronData.recipients.head(5)

Unnamed: 0,email_id,recipient
0,./data/1/10425.txt,kenneth.lay@enron.com
1,./data/1/10425.txt,mark.frevert@enron.com
2,./data/1/10425.txt,jeff.skilling@enron.com
3,./data/1/10425.txt,mark.schroeder@enron.com
4,./data/1/10425.txt,joseph.sutton@enron.com


## Basic analysis

Let's now do some basic analysis to see how we can use this data and play with it to get some insights and information of value.


### Question 1

In the next couple sections I am trying to answer the following question:

**Let's label an email as "direct" if there is exactly one recipient and "broadcast" if it has multiple recipients. Identify the top 3 people who received the largest number of direct emails and the person (or people) who sent the largest number of broadcast emails.**

In [5]:
directs = pd.merge(enronData.recipients, enronData.emails[enronData.emails['num_recipients'] == 1], left_on='email_id', right_index=True)[['ts', 'recipient']]
directs = directs.groupby('recipient').count().sort_values(by='ts', ascending=[0])
directs.columns = ['direct_email_count']
directs.head()

Unnamed: 0_level_0,direct_email_count
recipient,Unnamed: 1_level_1
maureen.mcvicker@enron.com,115
vkaminski@aol.com,43
jeff.dasovich@enron.com,25
richard.shapiro@enron.com,23
elizabeth.linnell@enron.com,18


In [6]:
broadcasts = enronData.emails[enronData.emails['num_recipients'] > 1][['sender', 'ts']]
broadcasts = broadcasts.groupby('sender').count().sort_values(by='ts', ascending=[0])
broadcasts.columns = ['broadcast_email_count']
broadcasts.head()

Unnamed: 0_level_0,broadcast_email_count
sender,Unnamed: 1_level_1
steven.kean@enron.com,252
john.shelk@enron.com,83
j.kaminski@enron.com,40
miyung.buster@enron.com,31
alan.comnes@enron.com,19


### Answer 1

Based on the outputs above, we can say:

- The top three people who received the largets number of direct mail are:
    1. Maureen McVicker (maureen.mcvicker@enron.com)
    2. V Kaminski (vkaminski@aol.com)
    3. Jeff Dasovich (jeff.dasovich@enron.com)
- The person who sent the largest number of direct email is **Steven Kean**

---------------

### Question 2

In the section I am trying to answer the following question:

**Find the five emails with the fastest response times. Please include file IDs, subject, sender, recipient, and response times. (A response is defined as a message from one of the recipients to the original sender whose subject line contains all of the words from the subject of the original email, and the response time should be measured as the difference between when the original email was sent and when the response was sent.)**

In [7]:
# Nested joins to find all emails to which an email can be a potential response
responses = pd.merge(
    pd.merge(
        enronData.emails, 
        enronData.recipients, 
        left_on='sender',
        right_on='recipient'), 
    enronData.emails, 
    left_on='email_id', 
    right_index=True)

# Drop unnecessary columns and rename some of the ones we are using
responses = responses[['ts_x', 'subject_x', 'email_id', 'recipient', 'ts_y', 'sender_y', 'subject_y']]
responses.columns = ['response_ts', 'response_subject', 'email_id', 'recipient', 'ts', 'sender', 'subject']

# Apply conditions
responses['lowercase_subject'] = responses.subject.apply(lambda value: value.lower())
responses['response_time'] = responses['response_ts'] - responses['ts']
responses = responses[(responses.lowercase_subject != '') 
                      & (responses.lowercase_subject != 're:') 
                      & (responses.lowercase_subject != 'fwd:') 
                      & (responses.response_time > 0)]
responses = responses[responses.apply(lambda row: row['subject'] in row['response_subject'], axis=1)]

# Pick the shortest response time for each email
responses = responses.sort_values(by=['email_id', 'response_time'], ascending=[1, 1])
responses = responses.groupby(['email_id']).first().sort_values(by='response_time', ascending=[1])

# Retain useful columns and rename some
responses = responses[['subject', 'sender', 'recipient', 'response_time']]
responses.columns = ['subject', 'sender', 'recipient_responder', 'response_time_in_seconds']
responses.head()

Unnamed: 0_level_0,subject,sender,recipient_responder,response_time_in_seconds
email_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
./data/1/139495.txt,FW: Confidential - GSS Organization Value to ETS,rod.hayslett@enron.com,stanley.horton@enron.com,148
./data/1/228996.txt,RE: CONFIDENTIAL Personnel issue,michelle.cash@enron.com,lizzette.palmer@enron.com,236
./data/1/121747.txt,Re: CONFIDENTIAL - Residential in CA,karen.denne@enron.com,jeff.dasovich@enron.com,240
./data/4/122923.txt,RE: Eeegads...,paul.kaufman@enron.com,jeff.dasovich@enron.com,240
./data/1/201878.txt,FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C...,m..tholt@enron.com,stephanie.miller@enron.com,262


### Answer 2

Based on the outputs above, we can say that the five emails with the fastest response times are:

| id   | email_id | subject | sender | recipient_responder | response_time_in_seconds |			
| ---  | -------- | ------- | ------ | ------------------- | ------------------------ |
| 1 | ./data/1/139495.txt | FW: Confidential - GSS Organization Value to ETS | rod.hayslett@enron.com | stanley.horton@enron.com | 148 |
| 2 | ./data/1/228996.txt | RE: CONFIDENTIAL Personnel issue | michelle.cash@enron.com | lizzette.palmer@enron.com | 236 |
| 3 | ./data/1/121747.txt | Re: CONFIDENTIAL - Residential in CA | karen.denne@enron.com | jeff.dasovich@enron.com | 240 |
| 4 | ./data/4/122923.txt | RE: Eeegads... | paul.kaufman@enron.com | jeff.dasovich@enron.com | 240 |
| 5 | ./data/1/201878.txt | FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C... | m..tholt@enron.com | stephanie.miller@enron.com | 262 |