<a href="https://colab.research.google.com/github/michael-borck/isys2001-assignment/blob/main/enron_staff_answer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enron Email Analysis

This notebook is an initial investigation into Enron emails.  It uses the publicly available Enron e-mail dataset. This is a very famous dataset consisting of e-mail messages sent to, from, and between employees working at the now-defunct Enron Corporation. As part of the U.S. Government investigation into accounting fraud at Enron, the e-mails became part of the public record and are now downloadable by anyone. 

This report will focus on two areas:

* the volume of daily emails
* who were the greatest communicators

We will only consider emails sent between....


First, we need to download the MySQL Enron corpus using the instructions at http://www.ahschulz.de/enron-email-data/.  The data is now ready to be queried using either the MySQL command-line interface or using a web-based tool such as PHPMyAdmin.


# Report Format

This report is provided as an interactive notebook.  It is designed to run on the Google Colab environment.  You can either read the report as provided or inspect and rerun the code alowing you to verify the analysis.

Right away, we notice that numerous e-mails have incorrect dates, for example, there are a number of dates that seem to predate or postdate the existence of the corporation (for example, 1979) or that were from years that were illogical
(for example, 0001 or 2044). E-mail is old but not that old!
The following table shows an excerpt of a few of the weird lines (the complete result set is about 1300 rows long) All of these dates are formatted correctly; however, some of the dates are definitely wrong:

In [None]:
#@title

# import the require packages
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
import os


# Copy the Enron Email database
if not os.path.exists('enron.db'):
  from IPython.display import clear_output 
  !wget -O enron.db https://curtin-my.sharepoint.com/:u:/g/personal/211934g_curtin_edu_au/EaYagsqa2r1Bi5wtHbswGFwBH2kd2uTnz6rlka7GI36GUQ?download=1
  clear_output()
  print('Copied Enron database')

 
# connect to the database
con = sqlite3.connect('enron.db')
 
# create a variable to store the query
SQL = '''
    SELECT date(date) AS date_sent, count(mid) AS message_count
        FROM message
        WHERE date_sent BETWEEN '1998-01-01' AND '2002-12-31'
        GROUP BY date_sent
        ORDER BY date_sent;
'''
 
# Load the query into a dataframe
df = pd.read_sql(SQL, con)

# Plot the date_df
df.plot(x = 'date_sent', 
        y = 'message_count',
        title = 'Sender and Message count',
        xlabel = 'Date',
        ylabel = 'Message Count',
        figsize = (10,20)
        )

# Rotate and Name the Title and Labels
plt.xticks(rotation=40, horizontalalignment="center")

# Discussions

The line graphs reveal that Enron had several significant peaks in e-mail traffic. The largest peaks and heaviest traffic occurred in the October and November of 2001, when the scandal broke. The two smaller peaks occurred around June 26-27 of 2001 and December 12-13 of 2000, when similar newsworthy events involving Enron transpired (one involving the California energy crisis and another involving a leadership change at the company).


# References

