# Enron analysis using grouped edges

All emails from user a to user b are collapsed into a single edge e(a, b) with the weight of the edge given by the number of mails. In this way we can look at things like measuring proximity between 2 users by looking at the edge weights

In [1]:
# Library imports
import pandas as pd
import networkx as nx
import numpy as np
from NetworkGraph import NetworkGraph
import matplotlib.pyplot as plt
import operator

%matplotlib inline

In [2]:
df=pd.read_csv("Output\\emails_all_bck.csv")

df.shape

(2811799, 2)

In [3]:
df2 = df.groupby(['From','To']).size().reset_index()

df2.columns = ['From', 'To', 'Weight']

df2.head()

Unnamed: 0,From,To,Weight
0,'todd'.delahoussaye@enron.com,'todd'.delahoussaye@enron.com,5
1,'todd'.delahoussaye@enron.com,ajay.sharma@enron.com,5
2,'todd'.delahoussaye@enron.com,anne.bike@enron.com,1
3,'todd'.delahoussaye@enron.com,bianca.ornelas@enron.com,5
4,'todd'.delahoussaye@enron.com,brant.reves@enron.com,5


The NetworkGraph class takes in a list of tuples containing the edges - as such we need to convert the grouped data frame before generating the graph object

In [4]:
# Store the smaller dataset for later use
# df2.to_csv('Output/grouped_mails.csv')

In [5]:
edges = [(x[0], x[1], 1.0/x[2]) for x in list(df2.values)]
len(edges)

288695

In [6]:
G = NetworkGraph(edges)

G.graph.number_of_nodes()

71971

In [7]:
G.graph.number_of_edges()

288695

In [8]:
nx.number_connected_components(G.graph.to_undirected())

1124

This is a surprisingly high number of connected components and is most likely down to the witheld data - some of these gaps should be filled in with the remaining data so that the number of components decreases. Regardless, the focus should be on individual components - we can focus on those with features / individuals of interest.
Lets first see how many nodes/edges each component has and then identify cycles / cliques withtin them

In [9]:
subgraphs = nx.weakly_connected_component_subgraphs(G.graph, copy=True)

In [10]:
components = []

for g in subgraphs:
    params = (g.number_of_edges(), g.number_of_nodes())
    components.append(params)
    
components.sort(reverse=True)
components[:10]

[(287014, 69476),
 (100, 101),
 (42, 43),
 (28, 29),
 (22, 23),
 (21, 14),
 (14, 15),
 (13, 14),
 (12, 13),
 (10, 11)]

It looks like most of the graph is contained in a single large component with the remaining pieces small components with just a few nodes. Lets try and find any cycles in any of the smaller ones

In [11]:
subgraphs = nx.weakly_connected_component_subgraphs(G.graph, copy=True)

for g in subgraphs:
    if g.number_of_edges() < 2000:
        net = NetworkGraph(g)
        net.findCycles()
        if len(net.cycles) > 0:
            net.printCycles()
    else:
        # Store the maximal component for later
        G2 = NetworkGraph(g)

merlyn@stonehenge.com->cp@onsitetech.com->tex@off.org->merlyn@stonehenge.com
merlyn@stonehenge.com->cp@onsitetech.com->merlyn@stonehenge.com
merlyn@stonehenge.com->cp@onsitetech.com->tex@off.org->merlyn@stonehenge.com
merlyn@stonehenge.com->cp@onsitetech.com->merlyn@stonehenge.com


There doesn't look to be anything of interest in the small components so we can probably discard them when doing the analysis

As it stands this graph is too big to perform satisfactory analysis on. However, based on the problem statement we know of 5 users who have already been convicted - Chief Executive Officer Jeff Skilling, CEO and chairman Ken Lay, Chief Financial Officer Andrew Fastow, Chief Accounting Officer Rick Causey and Corporate Treasurer Ben Glisan. We want to take these known users as the starting point and only look at a reduced subset featuring users in close proximity to these. Proximity will be computed based on path distance within the graph.

We'll compute the distance between these 5 and all other users then, for each of the 5 take the 500 closest to each of them and drill down to the subgraph formed by these users.

The list of nodes that we want to focus on is therefore:
- 'jeff.skilling@enron.com'
- 'kenneth.lay@enron.com'
- 'andrew.fastow@enron.com'
- 'ben.glisan@enron.com'
- 'richard.causey@enron.com'


In [12]:
# Find those present in the graph

convicts = ['jeff.skilling@enron.com', 'kenneth.lay@enron.com',
           'andrew.fastow@enron.com', 'ben.glisan@enron.com',
           'richard.causey@enron.com']

present_convicts = [x for x in convicts if x in list(G.graph.nodes())]
present_convicts

['jeff.skilling@enron.com',
 'kenneth.lay@enron.com',
 'andrew.fastow@enron.com',
 'ben.glisan@enron.com',
 'richard.causey@enron.com']

In [13]:
# Store proximities for each convict
distances = {}
close_nodes = []

for c in convicts:
    distances[c] = nx.nx.shortest_path_length(G.graph, source=c, target=None, weight='weight')
    ordered_nodes = sorted(distances[c].items(), key=operator.itemgetter(1))
    close_nodes += [x[0] for x in ordered_nodes[:500]]
    
close_nodes = list(set(close_nodes))
len(close_nodes)

1193

In [14]:
ordered_nodes[:20]

[('richard.causey@enron.com', 0),
 ('sally.beck@enron.com', 0.1),
 ('lexi.elliott@enron.com', 0.1),
 ('patti.thompson@enron.com', 0.10246913580246914),
 ('leslie.reeves@enron.com', 0.10462962962962963),
 ('brent.price@enron.com', 0.10469483568075118),
 ('beth.apollo@enron.com', 0.1053475935828877),
 ('louise.kitchen@enron.com', 0.10591715976331362),
 ('mary.solmonson@enron.com', 0.10689655172413794),
 ('sheila.glover@enron.com', 0.10704225352112677),
 ('greg.piper@enron.com', 0.10709219858156029),
 ('mike.jordan@enron.com', 0.10724637681159421),
 ('cwhite@viviance.com', 0.1076923076923077),
 ('peggy.hedstrom@enron.com', 0.1078125),
 ('shona.wilson@enron.com', 0.1078740157480315),
 ('john.lavorato@enron.com', 0.10793650793650794),
 ('bob.hall@enron.com', 0.10806451612903226),
 ('brenda.herod@enron.com', 0.10819672131147541),
 ('david.delainey@enron.com', 0.10862068965517242),
 ('robert.superty@enron.com', 0.10877192982456141)]

As expected there is significant overlap between the closest nodes to each convict. Let's attempt to find regions of interest within this subgraph (Cliques, cycles and black-holes / volcanoes

In [15]:
g2 = G.graph.subgraph(close_nodes)

In [16]:
# Create a network graph using this subset
G2 = NetworkGraph(g2)
G2.graph.number_of_edges()

31877

In [17]:
# Save to csv
# nx.write_edgelist(G2.graph, "Output\\convict_subgraph.csv", delimiter=',')

Using the implemented algorithms, identify the cycles, cliques and blackholes within the graph

In [18]:
G2.findCycles()

len(G2.cycles)

0

In [19]:
#cycles = list(nx.simple_cycles(G2.graph))
#len(cycles)

In [20]:
#lengths = [len(c) for c in G2.cycles]
#lengths.sort(reverse=True)

#lengths[:20]

In [21]:
# G2.findMaximalCliques()
# len(G2.maximal_cliques)

# Maximum recursion depth error with this method, despite the small graph size

In [22]:
# Use the networkx clique finder instead

cliques = list(nx.find_cliques(G2.graph.to_undirected()))

len(cliques)

54539

In [23]:
list(cliques[0])

['susan.mara@enron.com',
 'matt.motley@enron.com',
 'alan.comnes@enron.com',
 'tim.belden@enron.com',
 'robert.badeer@enron.com',
 'sean.crandall@enron.com',
 'rcarroll@bracepatt.com',
 'sarah.novosel@enron.com',
 'mike.swerzbin@enron.com',
 'mary.hain@enron.com',
 'lysa.akin@enron.com']

If any of the convicts were present in a clique then the other members of that clique would be under suspicion. Lets check if we have any such cliques

In [24]:
suspicious_cliques = []

for c in convicts:
    suspicious_cliques += [clique for clique in cliques if c in list(clique)]
    
len(suspicious_cliques)

9057

To narrow down this number further we consider cliques with more than 1 convict. We can group the cliques into 5 categories based on the number of convicts present within the clique

In [29]:
# Dictionary to store clique groupings
convicts_per_clique = {}
for i in range(1, 6):
    convicts_per_clique[i] = []

# Transform convicts to a set object to look at intersections
convict_set = set(convicts)
for clique in suspicious_cliques:
    intersect = len(convict_set.intersection(set(list(clique))))
    convicts_per_clique[intersect].append(clique)
    
for i in range(1, 6):
    print i, len(convicts_per_clique[i])

1 6770
2 2104
3 183
4 0
5 0


The highest number we get is 3 convicts within a single clique. There will probably be overlap in the nodes within these cliques so lets combine all 183 together and look at the list of unique nodes that comes out

In [31]:
three_convict_cliques = []
for c in convicts_per_clique[3]:
    three_convict_cliques += list(c)
    
three_convict_cliques = list(set(three_convict_cliques))
len(three_convict_cliques)

31

In [32]:
three_convict_cliques

['david.oxley@enron.com',
 'jeff.skilling@enron.com',
 'ben.glisan@enron.com',
 'rosalee.fleming@enron.com',
 'greg.whalley@enron.com',
 'john.sherriff@enron.com',
 'lexi.elliott@enron.com',
 'rebecca.carter@enron.com',
 'wes.colwell@enron.com',
 'sally.beck@enron.com',
 'j..kean@enron.com',
 'paula.rieker@enron.com',
 'linda.robertson@enron.com',
 'sherri.sera@enron.com',
 'david.delainey@enron.com',
 'john.lavorato@enron.com',
 'l..wells@enron.com',
 'katherine.brown@enron.com',
 'kenneth.lay@enron.com',
 'maureen.mcvicker@enron.com',
 'bryan.seyfried@enron.com',
 'sharron.westbrook@enron.com',
 'andrew.fastow@enron.com',
 'maureen.raymond@enron.com',
 'vanessa.groscrand@enron.com',
 'mike.mcconnell@enron.com',
 'karen.denne@enron.com',
 'gary.hickerson@enron.com',
 'rick.buy@enron.com',
 'louise.kitchen@enron.com',
 'richard.causey@enron.com']

In [25]:
G2.findBlackHoles(30)

len(G2.black_holes)

# Despite the pruning, the brute-force part of the algorithm is still too intensive
# Memory error on the step to calculate all possible subsets of the remaining nodes

5

In [26]:
for b in G2.black_holes:
    print b

['audrey.cook@enron.com', 'lisa.valderrama@enron.com']
['fatimata.liamidi@enron.com', 'vkamins@enron.com']
['jdasovic@enron.com', 'steven.j.kean@enron.com']
['jfrizzell@gibbs-bruns.com', 'richard.b.sanders@enron.com']
['liz@luntz.com', 'skean@enron.com']


In [27]:
G2.findVolcanos(30)

len(G2.volcanoes)

0