# Solution Overview

The task asks the program to find emails that match a given string the user passes. The solution should not read all data into memory and should seek to be as efficient as possible in the search.

This solution has two key portions: parsing and indexing all Enron emails by sender, recipient, and unique words per email. This pre-processing will then be applied to all emails in the corpus such that all emails are indexed. The second portion -- the true search function -- then allows a user to search a given word across all email metadata (unique words, sender, and recipient).

## Key caveats

Due to preprocessing time constraints, this solution only works for a single Enron employee (allen-p), but the code is applicable to any employee.

Secondly, this notebook contains **both** portions of the above stated solution (pre-processing and the actual search function). Because of this, the preprocessed text is still unnecessarily in-memory. In a deployed environment, the search function would be separately exported as a Python script, and it would be called in order to search through metadata that are static files. This is clearly denoted in the below comments.

Lastly, a to-do solution portion is noted: this search function has yet to recieve powerful optimization. It loops through all emails. A next step is narrowing search scope down to the email_sender or email_recipient (if the lawyers know this information). In addition, the metadata files should be indexed themselves as trees.



## Index all emails

This portion demonstrates how the email is being pre-processed. Note that a single user is being declared (allen-p), but this could be updated for any given user. Moreover, the file structure assumes it is operating on my local machine -- or that this notebook is opened in the same directory as all the Enron employee folders.

In [1]:
import os
from email.parser import Parser

In [2]:
# list all Enron employees
enron_employees = list()
for x in os.listdir():
    enron_employees.append(x)

In [3]:
# hard code to remove a couple unnecessary folders
enron_employees.remove('.DS_Store')
enron_employees.remove('.ipynb_checkpoints')

In [4]:
enron_employees

['allen-p',
 'arnold-j',
 'arora-h',
 'badeer-r',
 'bailey-s',
 'bass-e',
 'baughman-d',
 'beck-s',
 'benson-r',
 'blair-l',
 'brawner-s',
 'buy-r',
 'campbell-l',
 'carson-m',
 'cash-m',
 'causholli-m',
 'corman-s',
 'crandell-s',
 'cuilla-m',
 'dasovich-j',
 'davis-d',
 'dean-c',
 'delainey-d',
 'derrick-j',
 'dickson-s',
 'donoho-l',
 'donohoe-t',
 'dorland-c',
 'enron-search-solution-updated.ipynb',
 'ermis-f',
 'farmer-d',
 'fischer-m',
 'forney-j',
 'fossum-d',
 'gang-l',
 'gay-r',
 'geaccone-t',
 'germany-c',
 'gilbertsmith-d',
 'giron-d',
 'griffith-j',
 'grigsby-m',
 'guzman-m',
 'haedicke-m',
 'hain-m',
 'harris-s',
 'hayslett-r',
 'heard-m',
 'hendrickson-s',
 'hernandez-j',
 'hodge-j',
 'holst-k',
 'horton-s',
 'hyatt-k',
 'hyvl-d',
 'indexed_emails.csv',
 'jones-t',
 'kaminski-v',
 'kean-s',
 'keavey-p',
 'keiser-k',
 'king-j',
 'kitchen-l',
 'kuykendall-t',
 'lavorato-j',
 'lay-k',
 'lenhart-m',
 'lewis-a',
 'linder-e',
 'lokay-m',
 'lokey-t',
 'love-p',
 'lucci-p',
 'mag

Caveat to note from this point forward: I am **only** indexing the data of one Enron employee. All code below this says `user`, where one could iterate through all employees to create the necessary index of all Enron emails.

In [5]:
# hardcode solution for a single user - first one provided
user = 'allen-p'

In [6]:
# get all user directories -- some users have more folders than others
user_dirs = []
for x in os.listdir(user):
    user_dirs.append(x)

In [7]:
# hard code to remove Mac default
user_dirs.remove('.DS_Store')

In [8]:
# identify relevant directories
user_dirs

['_sent_mail',
 'all_documents',
 'contacts',
 'deleted_items',
 'discussion_threads',
 'inbox',
 'notes_inbox',
 'sent',
 'sent_items',
 'straw']

In [9]:
# function to parse email to, from, body
def email_analyze(inputfile, to_email_list, from_email_list, email_body):
    with open(inputfile, "r") as f:
        data = f.read()
 
    email = Parser().parsestr(data)
    
    if email['to']:
        email_to = email['to']
        email_to = email_to.replace("\n", "")
        email_to = email_to.replace("\t", "")
        email_to = email_to.replace(" ", "")
        
    to_email_list.append(email['to'])
    from_email_list.append(email['from'])
 
    email_body.append(email.get_payload())

In [10]:
# key metadata
to_email_list = []
from_email_list = []
email_body = []
email_id = []

In [11]:
# create to, from, body, and email_id for each email for this user
for folder in user_dirs:
    rootdir = str(user+"/" + folder)
    #print(rootdir)
    for directory, subdirectory, filenames in  os.walk(rootdir):
        for filename in filenames:
            email_analyze(os.path.join(directory, filename), to_email_list, from_email_list, email_body )
            email_id.append(os.path.join(directory, filename))

In [12]:
# create unique set of words for all emails
unique_words = []
for email in email_body:
    temp = email.lower().split()
    set(temp)
    unique_words.append(temp)

## Export indexed email metadata *(UPDATED)*

We've created our email metadata. Now we will export this metadata to a file. This metadata is what we will search on device to identify the relevant `email_id`s of interest.


In [13]:
# create metadata
# key: email_id, value: list of unique words in email body, recipient, and sender
meta_data = dict(zip(email_id, zip(unique_words, to_email_list, from_email_list)))

In [14]:
# eg: the index position (as a file location), unique words in email body, recipient, and sender
meta_data['allen-p/_sent_mail/1001.']

(["let's", 'shoot', 'for', 'tuesday', 'at', '11:45.'],
 'greg.piper@enron.com',
 'phillip.allen@enron.com')

In [15]:
import csv

In [16]:
# write data to csv
w = csv.writer(open("indexed_emails.csv", "w"))
for key, val in meta_data.items():
    w.writerow([key, val])

## Search function *(UPDATED)*

The above code is run **before** the email data goes onto the mobile device. Therefore, what is loaded on device is twofold: (1) the raw, initial data and (2) the metadata CSV produced above.

The below search function then is called across that metadata CSV, and is subject to memory constraint.

In [17]:
# define ability to read a file greater than size of memory - our metadata, in this case
def read_large_file(file_object):
    '''
    A generator function to read the large file lazily
    Key assumption being made here: one row of metadata (one email) does not exceed the size of on-device memory
    '''

    # loop indefinitely until the end of the file
    while True:
        data = file_object.readline()
        
        # break if this is the end of the file
        if not data:
            break
        yield data

In [18]:
# search function - looks at metadata CSV lazily to find matching terms. Returns email_id of matching emails.
def search(search_term, sent_to = None, sent_from = None, ):
    '''
    Pass the search term (word) of interest as lowercase. Matching email search indexes (file locations) are returned.
    TO DO: search by sender and recipient -- order this *FIRST*
    '''
    
    email_match_ids = list()
    
    # open a connection to the metadata
    with open('indexed_emails.csv') as file:
        
        # iterate over the generator from read_large_file()
        for line in read_large_file(file):
            
            # split the row into commas
            row = line.split(',')
            
            # search the row from index position 1 (position 0 is the email_id)
            for word in row[1:]:
                if search_term in word:
                    
                    # append the emaiil id to a list of relevant email_ids
                    email_match_ids.append(row[0])

    # return final list of matching email ids
    return email_match_ids


In [19]:
# write a test query
match_ids = search('here')

In [20]:
# which emails matched?
print(str(len(match_ids)) + " emails match the search query. Their index positions are:")
match_ids

1935 emails match the search query. Their index positions are:


['allen-p/_sent_mail/1.',
 'allen-p/_sent_mail/10.',
 'allen-p/_sent_mail/10.',
 'allen-p/_sent_mail/10.',
 'allen-p/_sent_mail/102.',
 'allen-p/_sent_mail/104.',
 'allen-p/_sent_mail/105.',
 'allen-p/_sent_mail/105.',
 'allen-p/_sent_mail/106.',
 'allen-p/_sent_mail/106.',
 'allen-p/_sent_mail/107.',
 'allen-p/_sent_mail/110.',
 'allen-p/_sent_mail/110.',
 'allen-p/_sent_mail/111.',
 'allen-p/_sent_mail/115.',
 'allen-p/_sent_mail/115.',
 'allen-p/_sent_mail/116.',
 'allen-p/_sent_mail/116.',
 'allen-p/_sent_mail/117.',
 'allen-p/_sent_mail/117.',
 'allen-p/_sent_mail/117.',
 'allen-p/_sent_mail/118.',
 'allen-p/_sent_mail/118.',
 'allen-p/_sent_mail/118.',
 'allen-p/_sent_mail/119.',
 'allen-p/_sent_mail/123.',
 'allen-p/_sent_mail/123.',
 'allen-p/_sent_mail/123.',
 'allen-p/_sent_mail/123.',
 'allen-p/_sent_mail/13.',
 'allen-p/_sent_mail/130.',
 'allen-p/_sent_mail/131.',
 'allen-p/_sent_mail/131.',
 'allen-p/_sent_mail/132.',
 'allen-p/_sent_mail/137.',
 'allen-p/_sent_mail/14.',

In [21]:
# display those emails to the user
for path in match_ids:
    with open(str("./"+ path), "r") as f:
        print(path)
        data = f.read()
        print(data)

allen-p/_sent_mail/1.
Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 
allen-p/_sent_mail/10.
Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: p

allen-p/all_documents/314.
Message-ID: <8873619.1075855672368.JavaMail.evans@thyme>
Date: Thu, 17 Feb 2000 05:07:00 -0800 (PST)
From: phillip.allen@enron.com
To: maryrichards7@hotmail.com
Subject: Re: February expenses
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: "mary richards" <maryrichards7@hotmail.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents
X-Origin: Allen-P
X-FileName: pallen.nsf

mary,

Are you sure you did the attachment right.  There was no file attached to 
your message.  Please try again.

Phillip
allen-p/all_documents/332.
Message-ID: <26412160.1075855672754.JavaMail.evans@thyme>
Date: Tue, 1 Feb 2000 08:07:00 -0800 (PST)
From: phillip.allen@enron.com
To: julie.gomez@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Julie A Gomez
X-cc: 
X-bcc: 
X-Folder: \Phillip

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Message-ID: <31444334.1075858639371.JavaMail.evans@thyme>
Date: Tue, 29 May 2001 05:04:48 -0700 (PDT)
From: k..allen@enron.com
To: editor@cookingsweeps.com
Subject: RE: 1/2 Price Omaha Steaks Sale!
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
X-From: Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=PALLEN>
X-To: 'editor@cookingsweeps.com@ENRON' <IMCEANOTES-editor+40cookingsweeps+2Ecom+40ENRON@ENRON.com>
X-cc: 
X-bcc: 
X-Folder: \PALLEN (Non-Privileged)\Allen, Phillip K.\Sent Items
X-Origin: Allen-P
X-FileName: PALLEN (Non-Privileged).pst

Please remove from email.

Phillip Allen

 -----Original Message-----
From: =09editor@cookingsweeps.com@ENRON [mailto:IMCEANOTES-editor+40cooking=
sweeps+2Ecom+40ENRON@ENRON.com]=20
Sent:=09Saturday, May 26, 2001 3:45 PM
To:=09pallen@enron.com
Subject:=091/2 Price Omaha Steaks Sale!


This message was not sent unsolicited. Your email has been submitted and ve=
rified for opt in promotions. 

**UPDATED**: How long did this take?

In [22]:
import time

In [23]:
t0 = time.time()
match_ids = search('here')
t1 = time.time()

In [24]:
total = t1-t0
print(total)

0.11143994331359863


Recall the above lazy list reading assumes the metadata of a single email will not exceed on-device RAM. If we believe a single email may exceed the size of all RAM, we could chunk the whole dataset and not treat it as a CSV. This requires reconsidering how to index our dataset.

Nonetheless, a hypothetical lazy reading process (subject to 1GB of on-device RAM) is below for posterity.

In [25]:
# def read_in_chunks(file, chunk_size=1024):
#     """
#     Generator to read a file piece by piece. We assume 1GB of memory.
#     """
#     while True:
#         data = file.read(chunk_size)
#         # break once at the end of the file
#         if not data:
#             break
#         yield data

In [26]:
# f = open('indexed_emails.csv')
# for piece in read_in_chunks(f):
#     process_data(piece)
#     break