## Overview
This notebook will serve for initial exploratory analysis of the Enrom email dataset.
We'll look at the volume of mails grouped by user and perform some basic text analysis of the contents to determine whether a distinction can be made between suspicious and general work mails.
Note that the dataset contains the full mail folders of the selected users, so this includes both inboxes and sent_items. To avoid duplicating mails we'll just look at the sent_items for each user*. This will also make it clearer when creating the network as a directed graph.

* This will require a bit of work - the folder names aren't standardised so there may not be a sent_mail folder for each user. In addition there are 2 users who do not have a folder containing the string 'sent'


In [1]:
# Library imports
import pandas as pd
import networkx as nx
import os
import numpy as np
from email.parser import Parser

The first step is to load the mail contents into a dataframe using Eannas mail_analyzer function

In [4]:
def analyse_email(inputfile, df):
    with open(inputfile, "r") as f:
        data = f.read()

    email = Parser().parsestr(data)

    if email['to']:
        email_to = email['to']
        email_to = email_to.replace("\n", "")
        email_to = email_to.replace("\t", "")
        email_to = email_to.replace(" ", "")

        email_to = email_to.split(",")

        to_length = len(email_to)

        from_col = [email['from']] * to_length
    else: 
        from_col = [email['from']]
        email_to = [""]
    
    email_df = pd.DataFrame(np.column_stack([from_col, email_to]),
                            columns = ['From', 'To'])
    return(email_df)

In [5]:
rootdir = "AAIC_Fraud_Hackathon\\maildir\\"

df = pd.DataFrame({
        'From' : [],
        'To' : []})

all_frames = []
for directory, subdirectory, filenames in os.walk(rootdir):
    frames = [analyse_email(os.path.join(directory, filename), df) for filename in filenames]
    all_frames.extend(frames)

df = pd.concat(all_frames)

df = df[df.To != ""]

df.to_csv("Output\\emails_all.csv",
          index = False)

### First simple graph
As a starting point we'll build an initial graph using this dataset. First though, we'll want to group our existing dataset by From and To ie any mails between the same people will be counted - this count will be used then as the edge weight

In [2]:
# Aggregate the dataset together - equivalent to a count(*) group by from and to
df=pd.read_csv("Output\\emails_all.csv")

distinct_mails = pd.DataFrame(df.groupby(['From', 'To']).size(), columns=['Count'])

distinct_mails.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
From,To,Unnamed: 2_level_1
'todd'.delahoussaye@enron.com,'todd'.delahoussaye@enron.com,5
'todd'.delahoussaye@enron.com,ajay.sharma@enron.com,5
'todd'.delahoussaye@enron.com,anne.bike@enron.com,1
'todd'.delahoussaye@enron.com,bianca.ornelas@enron.com,5
'todd'.delahoussaye@enron.com,brant.reves@enron.com,5


In [4]:
distinct_mails.shape

(288695, 1)

By grouping records with the same from and to we've reduced the number of rows from almost 3 million to under 300,000. This number will be the amount of edges in our graph. Let's look at the distict count of senders / recievers to determine the number of vertices.

In [20]:
# Store the edges in a seperate list
edgeList = distinct_mails.index[:]

edgeList[0]

("'todd'.delahoussaye@enron.com", "'todd'.delahoussaye@enron.com")

In [17]:
# Flatten the list to combine all senders and recipients into a single list
# Note that this step is slow...

users = list(sum(edgeList, ()))
len(users)

577390

In [18]:
# Translating that to a set will give a unique list of users in our dataset
users = list(set(users))
len(users)

71971

So, our graph will have 71,971 vertices with 288695 edges. Lets build a first graph 

In [22]:
# We'll extract the contents of distinct_mails into a list of tuples
weights = list(distinct_mails['Count'])
edges = zip(edgeList, weights)
edges = [(w[0][0], w[0][1], w[1]) for w in edges]

edges[0]

("'todd'.delahoussaye@enron.com", "'todd'.delahoussaye@enron.com", 5)

In [23]:
G = nx.DiGraph()

G.add_weighted_edges_from(edges)

In [24]:
G.number_of_nodes()

71971

In [25]:
G.number_of_edges()

288695

In [26]:
# Let's take a look at a particular user

G.neighbors("'todd'.delahoussaye@enron.com")

['brant.reves@enron.com',
 'veronica.espinoza@enron.com',
 'm..love@enron.com',
 'wendi.lebrocq@enron.com',
 'stewart.range@enron.com',
 'sandy.olitsky@enron.com',
 'luchas.johnson@enron.com',
 'randy.bhatia@enron.com',
 'reporting.exception@enron.com',
 'c..gossett@enron.com',
 'rudwell.johnson@enron.com',
 'leslie.reeves@enron.com',
 'anne.bike@enron.com',
 'tanya.rohauer@enron.com',
 'jeff.royed@enron.com',
 'richard.deming@enron.com',
 's..theriot@enron.com',
 'patrick.mulvany@enron.com',
 'jackson.logan@enron.com',
 'michelle.nelson@enron.com',
 'nick.moshou@enron.com',
 'shifali.sharma@enron.com',
 'nidia.mendoza@enron.com',
 'susan.bailey@enron.com',
 'keynan.dutton@enron.com',
 'credit<.williams@enron.com>',
 'bianca.ornelas@enron.com',
 'm..scott@enron.com',
 'lisa.hesse@enron.com',
 'paul.radous@enron.com',
 'd..sorenson@enron.com',
 'errol.mclaughlin@enron.com',
 'lesli.campbell@enron.com',
 'derek.bailey@enron.com',
 'laura.vargas@enron.com',
 'ellen.wallumrod@enron.com',
 