# Multi-table Datasets - ENRON Archive

## 1. Data import

Connect to the file 'assets/datasets/enron.db' using one of these methods:

- sqlite3 python package
- pandas.read_sql
- SQLite Manager Firefox extension

Take a look at the database and query the master table. How many Tables are there in the db?

> Answer:
There are 3 tables:
- MessageBase
- RecipientBase
- EmployeeBase

In [2]:
import sqlite3
import pandas as pd
enron_db = 'C:/Users/Pat.NOAGALLERY/Documents/data_sources/enron.db'
conn =sqlite3.connect(enron_db)

In [3]:
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE TYPE ='table'", con=conn)
tables


Unnamed: 0,name
0,MessageBase
1,RecipientBase
2,EmployeeBase


In [4]:
query = '''SELECT * FROM MessageBase LIMIT 10'''
pd.read_sql(query, con=conn)

Unnamed: 0,mid,filename,unix_time,subject,from_eid
0,1,taylor-m/sent/11,910930020,Cd$ CME letter,138
1,2,taylor-m/sent/17,911459940,Indemnification,138
2,3,taylor-m/sent/18,911463840,Re: Indemnification,138
3,4,taylor-m/sent/23,911874180,"Re: Coral Energy, L.P.",138
4,5,taylor-m/sent/27,912396120,Bankruptcy Code revisions,138
5,6,taylor-m/sent/31,912570420,Re: Position Description,138
6,7,taylor-m/sent/33,912576240,Koch,138
7,8,taylor-m/sent/40,912685080,Re: Time to Celebrate!,138
8,9,taylor-m/sent/41,912734100,Re: Vacation Request,138
9,10,taylor-m/sent/44,913166040,Re: Last Message,138


In [5]:
query = '''SELECT * FROM RecipientBase LIMIT 10'''
pd.read_sql(query, con=conn)

Unnamed: 0,mid,rno,to_eid
0,1,1,59
1,2,1,15
2,3,1,15
3,4,1,109
4,4,2,49
5,4,3,120
6,4,4,59
7,5,1,45
8,5,2,53
9,6,1,113


In [6]:
query = '''SELECT * FROM EmployeeBase LIMIT 10'''
pd.read_sql(query, con=conn)

Unnamed: 0,eid,name,department,longdepartment,title,gender,seniority
0,1,John Arnold,Forestry,ENA Gas Financial,VP Trading,Male,Senior
1,2,Harry Arora,Forestry,ENA East Power,VP Trading,Male,Senior
2,3,Robert Badeer,Forestry,ENA West Power,Mgr Trading,Male,Junior
3,4,Susan Bailey,Legal,ENA Legal,Specialist Legal,Female,Junior
4,5,Eric Bass,Forestry,ENA Gas Texas,Trader,Male,Junior
5,6,Don Baughman Jr.,Forestry,ENA East Power,Mgr Trading,Male,Junior
6,7,Sally Beck,Other,Energy Operations,VP,Female,Senior
7,8,Robert Benson,Forestry,ENA East Power,Dir Trading,Male,Senior
8,9,Lynn Blair,Other,ETS,Director,Female,Senior
9,10,Sandra F. Brawner,Forestry,ENA Gas East,Dir Trading,Female,Senior


Query the `sqlite_master` table to retrieve the schema of the `EmployeeBase` table.

1. What fields are there?
1. What's the type of each of them?

In [7]:

for table in tables['name']:
    print(table)
    for row in conn.execute("pragma table_info("+table+")").fetchall():
        print (row)

MessageBase
(0, 'mid', 'INTEGER', 0, None, 1)
(1, 'filename', 'TEXT', 0, None, 0)
(2, 'unix_time', 'INTEGER', 0, None, 0)
(3, 'subject', 'TEXT', 0, None, 0)
(4, 'from_eid', 'INTEGER', 0, None, 0)
RecipientBase
(0, 'mid', 'INTEGER', 0, None, 1)
(1, 'rno', 'INTEGER', 0, None, 2)
(2, 'to_eid', 'INTEGER', 0, None, 0)
EmployeeBase
(0, 'eid', 'INTEGER', 0, None, 0)
(1, 'name', 'TEXT', 0, None, 0)
(2, 'department', 'TEXT', 0, None, 0)
(3, 'longdepartment', 'TEXT', 0, None, 0)
(4, 'title', 'TEXT', 0, None, 0)
(5, 'gender', 'TEXT', 0, None, 0)
(6, 'seniority', 'TEXT', 0, None, 0)


1. Print the first 5 rows of EmployeeBase table
1. Print the first 5 rows of MessageBase table
1. Print the first 5 rows of RecipientBase table

**Hint**  use `SELECT` and `LIMIT`.

In [8]:
query = '''SELECT * FROM EmployeeBase LIMIT 5'''
print("EmployeeBase")
print(pd.read_sql(query, con=conn))
query = '''SELECT * FROM MessageBase LIMIT 5'''
print("MessageBase")
print(pd.read_sql(query, con=conn))
query = '''SELECT * FROM RecipientBase LIMIT 5'''
print("RecipientBase")
print(pd.read_sql(query, con=conn))


EmployeeBase
   eid           name department     longdepartment             title  gender  \
0    1    John Arnold   Forestry  ENA Gas Financial        VP Trading    Male   
1    2    Harry Arora   Forestry     ENA East Power        VP Trading    Male   
2    3  Robert Badeer   Forestry     ENA West Power       Mgr Trading    Male   
3    4   Susan Bailey      Legal          ENA Legal  Specialist Legal  Female   
4    5      Eric Bass   Forestry      ENA Gas Texas            Trader    Male   

  seniority  
0    Senior  
1    Senior  
2    Junior  
3    Junior  
4    Junior  
MessageBase
   mid          filename  unix_time                    subject  from_eid
0    1  taylor-m/sent/11  910930020             Cd$ CME letter       138
1    2  taylor-m/sent/17  911459940            Indemnification       138
2    3  taylor-m/sent/18  911463840        Re: Indemnification       138
3    4  taylor-m/sent/23  911874180     Re: Coral Energy, L.P.       138
4    5  taylor-m/sent/27  912396120  Ba

Import each of the 3 tables to a Pandas Dataframes

In [9]:
query = '''SELECT * FROM EmployeeBase'''
employee = pd.read_sql(query, con=conn)
print("employee shape ", employee.shape)
query = '''SELECT * FROM MessageBase'''
message = pd.read_sql(query, con=conn)
print("message shape ", message.shape)
query = '''SELECT * FROM RecipientBase'''
recipient = pd.read_sql(query, con=conn)
print("recipient shape ", recipient.shape)

employee shape  (156, 7)
message shape  (21635, 5)
recipient shape  (38388, 3)


## 2. Data Exploration

Use the 3 dataframes to answer the following questions:

1. How many employees are there in the company?
- How many messages are there in the database?
- Convert the timestamp column in the messages. When was the oldest message sent? And the newest?
- Some messages are sent to more than one recipient. Group the messages by message_id and count the number of recepients. Then look at the distribution of recepient numbers.
    - How many messages have only one recepient?
    - How many messages have >= 5 recepients?
    - What's the highest number of recepients?
    - Who sent the message with the highest number of recepients?
- Plot the distribution of recepient numbers using Bokeh.

In [10]:
print ("Total Employees = ", employee.shape[0])
print ("Total Messages = ", message.shape[0])

import datetime
import time
message['time'] = pd.to_datetime(message['unix_time'], unit='s')
print ("Oldest Message = ", min(message.time))
print ("Newest Message = ", max(message.time))

msg = message.groupby('from_eid').size()

print("There were ", msg[msg == 1].count(), " messages with only 1 recipient")
print("There were ", msg[msg >= 5].count(), " messages with >=  5 recipients")
print("The highest number of recipients for a message  ", max(msg))
print("The highest number of recipients for a message was sent by the user with eid ", msg[msg  == max(msg)].index[0])


Total Employees =  156
Total Messages =  21635
Oldest Message =  1998-11-13 04:07:00
Newest Message =  2002-06-21 13:37:34
There were  2  messages with only 1 recipient
There were  147  messages with >=  5 recipients
The highest number of recipients for a message   1597
The highest number of recipients for a message was sent by the user with eid  20


The following is the Bokeh distribution

In [11]:
from bokeh.charts import Histogram
from bokeh.sampledata.autompg import autompg as df
from bokeh.charts import defaults, vplot, hplot, show, output_file

defaults.width = 450
defaults.height = 350

# input options
hist = Histogram(msg, title="Distribution of Recepient Numbers")

hist.xaxis.axis_label = 'Messages'
hist.yaxis.axis_label = '# of Msgs'


output_file("histograms.html")

show(
    vplot(
        hplot(hist)
    )
)



Rescale to investigate the tail of the curve

No need to write any code here as the Bokeh graphs are interactive and you can pan in to examine the tail of the curve.

## 3. Data Merging

Use the pandas merge function to combine the information in the 3 dataframes to answer the following questions:

1. Are there more Men or Women employees?
- How is gender distributed across departments?
- Who is sending more emails? Men or Women?
- What's the average number of emails sent by each gender?
- Are there more Juniors or Seniors?
- Who is sending more emails? Juniors or Seniors?
- Which department is sending more emails? How does that relate with the number of employees in the department?
- Who are the top 3 senders of emails? (people who sent out the most emails)

MessageBase


(0, 'mid', 'INTEGER', 0, None, 1)

(1, 'filename', 'TEXT', 0, None, 0)

(2, 'unix_time', 'INTEGER', 0, None, 0)

(3, 'subject', 'TEXT', 0, None, 0)

(4, 'from_eid', 'INTEGER', 0, None, 0)

RecipientBase

(0, 'mid', 'INTEGER', 0, None, 1)

(1, 'rno', 'INTEGER', 0, None, 2)

(2, 'to_eid', 'INTEGER', 0, None, 0)

EmployeeBase

(0, 'eid', 'INTEGER', 0, None, 0)


(1, 'name', 'TEXT', 0, None, 0)

(2, 'department', 'TEXT', 0, None, 0)

(3, 'longdepartment', 'TEXT', 0, None, 0)

(4, 'title', 'TEXT', 0, None, 0)

(5, 'gender', 'TEXT', 0, None, 0)

(6, 'seniority', 'TEXT', 0, None, 0)


In [142]:
# Are there more Men or Women employees?
gender = employee.groupby('gender').size()
print ("\nThe gender split in the organization is as follows \n",gender)

#How is gender di(stributed across departments?
gender = employee.groupby(['department', 'gender']).size()
print("\nGender is split across departments as follows \n", gender)

#Who is sending more emails? Men or Women?
base = pd.merge(employee, message, left_on='eid', right_on='from_eid')
basegender = base.groupby('gender').size()
print("\nMessages sent split by gender \n", basegender)

# What's the average number of emails sent by each gender?
basegender = base.groupby('gender').mean()
print("\nMessages sent split by gender \n", basegender)


The gender split in the organization is as follows 
 gender
Female     43
Male      113
dtype: int64

Gender is split across departments as follows 
 department  gender
Forestry    Female    10
            Male      50
Legal       Female    13
            Male      12
Other       Female    20
            Male      51
dtype: int64

Messages sent split by gender 
 gender
Female     8794
Male      12841
dtype: int64

Messages sent split by gender 
               eid           mid     unix_time   from_eid
gender                                                  
Female  86.749033  10424.010348  9.836006e+08  86.749033
Male    73.391247  11087.818939  9.864964e+08  73.391247


Answer the following questions regarding received messages:

- Who is receiving more emails? Men or Women?
- Who is receiving more emails? Juniors or Seniors?
- Which department is receiving more emails? How does that relate with the number of employees in the department?
- Who are the top 5 receivers of emails? (people who received the most emails)

Which employees sent the most 'mass' emails?

Keep exploring the dataset, which other questions would you ask?

Work in pairs. Give each other a challenge and try to solve it.