## Assignment - Single Machine Parallelization

In this assignment we will be investigating the Enron email corpus from the 2002 Federal Energy Regulatory Commission (FERB) investigation.  This corpus of emails is still used to train AI on topics related to corporate communication, as this is one of the few public datasets available. Though there are AI ethics questions if the data is used for language purposes without proper precautions. [Read this article](https://qz.com/work/1546565/the-emails-that-brought-down-enron-still-shape-our-daily-lives/) to learn more about the dataset and its implications for AI.

Start off by choosing the `ml.m5.xlarge` instance as your kernel. Then start conducting the lab.

---

### Task 0: Download dataset and explore it

The research dataset is located here, maintained by Carnegie Mellon University, https://www.cs.cmu.edu/~enron/. Though due to its size we have created a smaller version on S3. Follow these steps to get the file:

1. Copy the file from the class's public S3 bucket into your own S3 bucket. You must change the bucket name to yours with your net id. `aws s3 cp s3://bigdatateaching/enron/maildir.zip s3://[NET_ID]-labdata`

In [2]:
!aws s3 cp s3://bigdatateaching/enron/maildir.zip s3://anly502-fall-2022-yl1353

copy: s3://bigdatateaching/enron/maildir.zip to s3://anly502-fall-2022-yl1353/maildir.zip


2. Copy the file from your own S3 bucket to SageMaker. Use the boto3 package for this. This will place the `maildir` directory from the zip file into the tmp folder of your EC2.

In [3]:
import boto3
my_bucket = 'anly502-fall-2022-yl1353'
my_file = 'maildir.zip'

s3client = boto3.client('s3')
s3client.download_file(my_bucket, my_file, '/tmp/maildir.zip')

3. Run the following command in a new cell to unzip the directory. This could take a while, there are almost 80,000 files!! We are explicitly saving to the /tmp file share because it is part of your EC2. If you use your git repository to store the maildir zip or unpacked files, we would be here a while....

In [4]:
!cd /tmp; unzip -q /tmp/maildir.zip

The bullets below represent the file structure of interest from the Enron dataset. To confirm that you have unzipped and placed the data in the correct location, click on the links of files, e.g. `1_`, `2_`, etc., which will open the text files for you to explore. Notice that these are all raw text files.

* [maildir](./maildir)
 * [allen-p](./maildir/allen-p)
   * [inbox](./maildir/allen-p/inbox)
       * [1_](./maildir/allen-p/inbox/1_)
       * [2_](./maildir/allen-p/inbox/2_)
       * [3_](./maildir/allen-p/inbox/3_)
       ...
   * [sent_items](./maildir/allen-p/sent_items)
       * [1_](./maildir/allen-p/sent_items/1_)
       * [2_](./maildir/allen-p/sent_items/2_)
       * [3_](./maildir/allen-p/sent_items/3_)
       * etc...
   * etc...
 * [arnold-j](./maildir/arnold-j)
   * [inbox](./maildir/arnold-j/inbox)
       * [1_](./maildir/arnold-j/inbox/1_)
       * [2_](./maildir/arnold-j/inbox/2_)
       * [3_](./maildir/arnold-j/inbox/3_)
       * etc ...
   * [sent_items](./maildir/arnold-j/sent_items)
       * [1_](./maildir/arnold-j/sent_items/1_)
       * [2_](./maildir/arnold-j/sent_items/2_)
       * [3_](./maildir/arnold-j/sent_items/3_)
       * etc...
   * etc...
 * [arora-h](./maildir/arora-h)
   * [inbox](./maildir/arora-h/inbox)
       * [1_](./maildir/arora-h/inbox/1_)
       * [2_](./maildir/arora-h/inbox/2_)
       * [3_](./maildir/arora-h/inbox/3_)
       * etc...
   * [sent_items](./maildir/arora-h/sent_items)
       * [1_](./maildir/arora-h/sent_items/1_)
       * [2_](./maildir/arora-h/sent_items/2_)
       * [3_](./maildir/arora-h/sent_items/3_)
       * etc...
   * etc...
 * etc... etc...

---------------------------

### Task #1: Collect all of the file paths of emails in the maildir (2 points)

The output object will be called `list_emails` should contain 79,429 items and look similar to the following:
```
['/tmp/maildir/allen-p/inbox/10_',
 '/tmp/maildir/allen-p/inbox/11_',
 '/tmp/maildir/allen-p/inbox/12_',
 '/tmp/maildir/allen-p/inbox/13_',
 '/tmp/maildir/allen-p/inbox/14_',
 '/tmp/maildir/allen-p/inbox/15_',
 '/tmp/maildir/allen-p/inbox/16_',
 '/tmp/maildir/allen-p/inbox/17_',
 '/tmp/maildir/allen-p/inbox/18_',
 '/tmp/maildir/allen-p/inbox/19_',
 ...
 ]
```

The broad strokes to creating this list goes as follows:

- use the command `os.listdir('/tmp/maildir')` to see the folders in the mail directory
- then use the command `os.listdir('/tmp/maildir/allen-p')` to see the folder in Allen P's folder
- then use the command `os.listdir('/tmp/maildir/allen-p/inbox')` to see the contents in Allen P's inbox and collect the full file paths
- you need to also collect all the files in both the inbox and sent_items folders for Allen P, do not track emails from the other folders
- next move onto Arnold J!
- next move onto Arora H!
- ....

Light bulbs should start going off in your head at this point about ways to solve this problem. This is a repeated process that needs to happen many many times. A for loop is the simplest way to solve a repetitive task. **Use for loops** to collect all 79,429 file paths into a single list.

**Hint:** look up `os.path.join()` or f-strings if you are having trouble creating the appropriate argument for the `os.listdir()` command

In the following cell(s) produce the list of emails as directed and save the result to the object `list_email_paths`

In [5]:
import os

In [13]:
list_email_paths = []
initialpath = '/tmp/maildir'
for folder in os.listdir(initialpath):
    for file in os.listdir(os.path.join(initialpath,folder)):
        if (file == "inbox" or file == "sent_items"):
            for inner_file in os.listdir(os.path.join(initialpath, folder, file)):
                path = os.path.join(initialpath, folder, file, inner_file)
                list_email_paths.append(path)

In the following cell(s) print the first 10 items of the list as well as the length of the list

In [26]:
list_email_paths[:10]

['/tmp/maildir/allen-p/inbox/.ipynb_checkpoints',
 '/tmp/maildir/allen-p/inbox/10_',
 '/tmp/maildir/allen-p/inbox/11_',
 '/tmp/maildir/allen-p/inbox/12_',
 '/tmp/maildir/allen-p/inbox/13_',
 '/tmp/maildir/allen-p/inbox/14_',
 '/tmp/maildir/allen-p/inbox/15_',
 '/tmp/maildir/allen-p/inbox/16_',
 '/tmp/maildir/allen-p/inbox/17_',
 '/tmp/maildir/allen-p/inbox/18_']

In [27]:
len(list_email_paths) #need to remove ".ipynb_checkpoints"

79430

In [28]:
list_email_paths[0]

'/tmp/maildir/allen-p/inbox/.ipynb_checkpoints'

In [32]:
string = '.ipynb_checkpoints'
for i in range(len(list_email_paths)):
    if string in list_email_paths[i]:
        list_email_paths.pop(i)

In [33]:
#check again
for path in list_email_paths:
    if '.ipynb' in path:
        print(path)

In [34]:
len(list_email_paths)

79429

--------------------

### Task 2: Read in one text email and parse out relevant information using a function (2 points)

Before you get to parallelizing, you need to build the function that will execute on every email in your list. The objective of this task is to build the function called `email_process()` which will take as an input the path of an email and output the following information in a dictionary:

* Email of the sender
* Email(s) of the recipient
* Email timestamp
* Email subject
* Email body

An example output from the function on the first email in Allen P's inbox looks like the following:

```
{'from': 'heather.dunton@enron.com',
 'to': 'k..allen@enron.com',
 'timestamp': 'Fri, 7 Dec 2001 10:06:42 -0800 (PST)'
 'subject': 'RE: West Position',
 'body': ' \nPlease let me know if you still need Curve Shift.\n\nThanks,\nHeather\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tFriday, December 07, 2001 5:14 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nDid you attach the file to this email?\n\n -----Original Message-----\nFrom: \tDunton, Heather  \nSent:\tWednesday, December 05, 2001 1:43 PM\nTo:\tAllen, Phillip K.; Belden, Tim\nSubject:\tFW: West Position\n\nAttached is the Delta position for 1/16, 1/30, 6/19, 7/13, 9/21\n\n\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tWednesday, December 05, 2001 6:41 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nThis is exactly what we need.  Would it possible to add the prior day for each of the dates below to the pivot table.  In order to validate the curve shift on the dates below we also need the prior days ending positions.\n\nThank you,\n\nPhillip Allen\n\n -----Original Message-----\nFrom: \tDunton, Heather  \nSent:\tTuesday, December 04, 2001 3:12 PM\nTo:\tBelden, Tim; Allen, Phillip K.\nCc:\tDriscoll, Michael M.\nSubject:\tWest Position\n\n\nAttached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24\n\n\n\n << File: west_delta_pos.xls >> \n\nLet me know if you have any questions.\n\n\nHeather'
 }
```

To process the raw text file, we recommend that you use the `email.parser.Parser()` function. Read documentation on it [here](https://docs.python.org/3/library/email.parser.html).

The to field will need some cleaning. Remove any carriage returns, tabs, or spaces from the to field. 

It's OK that the email body is really messy right now. A future task is to clean it up.

In the following cell(s) produce the function `email_process()`

In [35]:
from email.parser import Parser

In [50]:
def email_process(filename):
    keywords = ["from", "to","timestamp","subject", "body"]
    #create a dictionary to save outputs later
    email_dict = dict.fromkeys(keywords)
    
    #create a parsing objest to process raw text file
    parser = Parser()
    with open(filename, encoding = 'latin-1') as f:
        content = f.read()
    raw_text = parser.parsestr(content)
    
    email_dict['from'] = raw_text['From']
    email_dict['to'] = raw_text['To']
    email_dict['timestamp'] = raw_text['Date']
    email_dict['subject'] = raw_text['Subject']
    email_dict['body'] = raw_text.get_payload()
    
    f.close()
    return email_dict

In the following cell, save a new object called `email_1` as the output for the file `maildir/allen-p/inbox/1_`

In [51]:
email_1 = email_process('/tmp/maildir/allen-p/inbox/1_')

In [52]:
email_1

{'from': 'heather.dunton@enron.com',
 'to': 'k..allen@enron.com',
 'timestamp': 'Fri, 7 Dec 2001 10:06:42 -0800 (PST)',
 'subject': 'RE: West Position',
 'body': ' \nPlease let me know if you still need Curve Shift.\n\nThanks,\nHeather\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tFriday, December 07, 2001 5:14 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nDid you attach the file to this email?\n\n -----Original Message-----\nFrom: \tDunton, Heather  \nSent:\tWednesday, December 05, 2001 1:43 PM\nTo:\tAllen, Phillip K.; Belden, Tim\nSubject:\tFW: West Position\n\nAttached is the Delta position for 1/16, 1/30, 6/19, 7/13, 9/21\n\n\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tWednesday, December 05, 2001 6:41 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nThis is exactly what we need.  Would it possible to add the prior day for each of the dates below to the pivot table.  In order to validate the curve 

In [57]:
list(email_1.items())[1]

('to', 'k..allen@enron.com')

----------------

### Task 3: Read in all emails using multiprocessing (3 points)

There are two ways to implement multiprocessing on a list of data:

1. **The vanilla approach to parallelization** involves getting a list of the items to process, then using the multiple worker processes to complete all the items in parallel like queue system. The list gets processes by workers until there are no items left.
2. **The splitting approach to parallelization** involves getting a list of items to process, splitting that list into a number of sublists, then split each sublist to a worker. Each worker then processes the sublist in serial.

For more background on these two ideas, check out this useful [medium post](https://medium.com/idealo-tech-blog/parallelisation-in-python-an-alternative-approach-b2749b49a1e).

**If you are using a Windows machine (not in this lab)**, you will be unable to save your function in a cell and then run with multiprocesing. Instead, save two scripts `mail_functions_vanilla.py` and `mail_functions_split.py` with the functions needed for each approach. Import the module that you will execute in parallel by calling `import mail_functions_vanilla` and then calling a function within that module, just like you would any other module in python.

Use the `time` module to track the time taken for each approach. Note that you may have files that are corrupted and you will have to figure out how to handle them gracefully.

In the following cell(s), implement the vanilla approach to parallelization to process all emails using `pool.map()`. Save the output list to an object called `out_vanilla`.

In [48]:
import time
import multiprocessing as mp
from multiprocessing import Pool

In [54]:
start = time.time()

pool = mp.Pool(mp.cpu_count())

t1 = time.time()
out_vanilla = pool.map(email_process, list_email_paths)
t2 = time.time()

pool.close()

end = time.time()

print('total time',end - start)
print('time to set up pool',t1 - start)
print('time to multiprocess',t2 - t1)

total time 6.099192142486572
time to set up pool 0.02265024185180664
time to multiprocess 6.076471328735352


In the following cell(s), implement the splitting approach to parallelization to process all emails using `pool.map()`. Save the output list to an object called `out_split`.

In [59]:
import numpy as np
NUM_WORKERS = mp.cpu_count()
def splitter(list_paths, NUM_WORKERS):
    list_list_paths = []
    for i in np.array_split(list_paths, NUM_WORKERS): 
        list_list_paths.append(list(i))
    return list_list_paths
list_list_path = splitter(list_email_paths, NUM_WORKERS)        

In [60]:
def email_process_sublist(list_paths):
    email_sublist_dict = []
    for i in list_paths:
        email_sublist_dict.append(email_process(i))
    return email_sublist_dict

In [94]:
start = time.time()

pool = mp.Pool(mp.cpu_count())

t1 = time.time()
out_split_sub = pool.map(email_process_sublist, list_list_path)
t2 = time.time()

pool.close()

end = time.time()

print('total time',end - start)
print('time to set up pool',t1 - start)
print('time to multiprocess',t2 - t1)

total time 6.565039396286011
time to set up pool 0.05916404724121094
time to multiprocess 6.505750894546509


In the following cell(s), confirm that the results from each method are the same. If they are not, then figure out how to make them the same.

In [62]:
len(out_vanilla)

79429

In [95]:
#len(out_split) #num of workers
dict_1 = out_split_sub[0]
dict_2 = out_split_sub[1]
dict_3 = out_split_sub[2]
dict_4 = out_split_sub[3]

out_split = [*dict_1, *dict_2, *dict_3, *dict_4]
len(out_split)

79429

In the following cell(s), convert the list `out_split` to a Pandas DataFrame and save it as `df_emails`

In [72]:
import pandas as pd

In [79]:
df_emails = pd.DataFrame(columns = ['from','to','timestamp','subject','body'],dtype=object)
df_emails.reset_index()

Unnamed: 0,index,from,to,timestamp,subject,body


In [80]:
for results in out_split:
    df_emails = df_emails.append(results, ignore_index=True)

In [81]:
df_emails.head()

Unnamed: 0,from,to,timestamp,subject,body
0,anchordesk_daily@anchordesk.zdlists.com,pallen@enron.com,"Sun, 30 Dec 2001 22:49:42 -0800 (PST)",ANCHORDESK: Hope ahead: What I learned from 20...,\n\n_____________________DAVID COURSEY________...
1,subscriptions@intelligencepress.com,pallen@enron.com,"Sun, 30 Dec 2001 23:42:30 -0800 (PST)","NGI Publications - Monday, December 31st 2001","Dear phillip,\n\n\nThis e-mail is automated no..."
2,prizemachine@feedback.iwon.com,pallen@enron.com,"Mon, 31 Dec 2001 02:24:51 -0800 (PST)","Click. Spin. Chances to Win up to $10,000!","\n[IMAGE] [IMAGE] [IMAGE] [IMAGE] $ 2,500 ..."
3,louise.kitchen@enron.com,"wes.colwell@enron.com, georgeanne.hodges@enron...","Mon, 31 Dec 2001 10:53:43 -0800 (PST)",NETCO,The New Year has arrived and we really to fina...
4,arsystem@mailman.enron.com,k..allen@enron.com,"Mon, 31 Dec 2001 17:18:31 -0800 (PST)",Your Approval is Overdue: Access Request for m...,This request has been pending your approval fo...


-------------

### Task 4: Find the most frequent emailers in the dataset (3 points)

We want to know which people sent or received the most emails in the dataset. Use your Pandas dataframe fields `to` and `from` to count the number of emails where an email address appears in the data.

**You must use a dictionary** to keep track of the connections between email addresses. This will involve looping through each row and countering the dictionary up for each email instance found. For example, if we processed rows of the data and found that abc@enron.com sent one email to def@enron.com and ghi@enron.com, another sent email to jkl@enron.com as well as received an email from ghi@enron.com. This would result in a dictionary like so:

```
{
 'abc@enron.com' : 3,
 'ghi@enron.com' : 2,
 'def@enron.com' : 1,
 'jkl@enron.com' : 1
}
```

Note that we are counting the email address whether it is the recipient or the sender of the email. It is certainly possible that both sides of the email conversation appears in the dataset, but it is OK to count that twice. The `To` field can contain multiple emails. You have to parse and count each email recipient separately!

This process might require a lot of computing power to process every email! You must use `pool.map()` or another `multiprocessing` module function for the mapping part of the problem. The output will be a list of dictionaries like the example above. You will write a function to execute on a small chunk of the dataframe. After you have the list of dictionaries from each chunk, reduce them into a single dictionary.

**BONUS:** 1 point - implement the same map and reduce operations leveraging pandas vectorized operations and groupby-summarize functions.


**Hint:** break your dataframe into a list of smaller dataframes, and use that list to pass through `pool.map()`

Use the following cell(s) to answer this question:

In [87]:
#check data type
#type(df_emails['to'][1])
df_emails['to'] = df_emails['to'].astype(str)

In [88]:
emailadd_list = []

for s in df_emails['from']:
    emailadd_list.append(s)

for r in df_emails['to']: 
    r_list = r.split(",")
    for to_add in r_list:
        emailadd_list.append(to_add)

In [89]:
emailadd_list[:10]

['anchordesk_daily@anchordesk.zdlists.com',
 'subscriptions@intelligencepress.com',
 'prizemachine@feedback.iwon.com',
 'louise.kitchen@enron.com',
 'arsystem@mailman.enron.com',
 'anchordesk_daily@anchordesk.zdlists.com',
 'exclusive_offers@sportsline.com',
 'subscriptions@intelligencepress.com',
 'subscriptions@intelligencepress.com',
 'arsystem@mailman.enron.com']

In [96]:
#remove nan value
emailadd_list = [x for x in emailadd_list if x != 'nan']

In [102]:
## Removing none values
for add in emailadd_list:
    if (add == 'None'):
        emailadd_list.remove(add)

In [104]:
#parse and count each recipient
from collections import Counter
sorted_emailadd = Counter(emailadd_list).most_common()
sorted_emailadd[:10]

[('jeff.dasovich@enron.com', 2115),
 ('no.address@enron.com', 2054),
 ('d..steffes@enron.com', 1745),
 ('chris.germany@enron.com', 1609),
 ('louise.kitchen@enron.com', 1509),
 ('pete.davis@enron.com', 1448),
 ('j.kaminski@enron.com', 1441),
 ('gerald.nemec@enron.com', 1435),
 ('kimberly.watson@enron.com', 1326),
 ('sara.shackleton@enron.com', 1287)]

In [121]:
#execute a smaller chunk

new_emailadd_list=np.array_split(emailadd_list,1000)

pool = mp.Pool(mp.cpu_count())
out_small = pool.map(Counter, new_emailadd_list)
pool.close()

out_small[6]

Counter({'tgraham@tinsleygroup.com': 1,
         'e.murrell@enron.com': 1,
         'jeff.johnson@enron.com': 1,
         'greg.woulfe@enron.com': 1,
         'hector.mcloughlin@enron.com': 1,
         'jenny.rub@enron.com': 2,
         'david.port@enron.com': 1,
         'shona.wilson@enron.com': 3,
         'mike.jordan@enron.com': 1,
         'tina.spiller@enron.com': 1,
         'm.hall@enron.com': 1,
         'marla.barnard@enron.com': 1,
         'kenneth.thibodeaux@enron.com': 2,
         'christina.valdez@enron.com': 1,
         'rsuperty@yahoo.com': 1,
         'courtney.votaw@enron.com': 1,
         'beth.apollo@enron.com': 1,
         'john.allison@enron.com': 1,
         'w..white@enron.com': 2,
         'mark.pickering@enron.com': 1,
         'c..gossett@enron.com': 1,
         'louise.kitchen@enron.com': 1,
         'jeff.bartlett@enron.com': 1,
         'sally.beck@enron.com': 482,
         'chairman.office@enron.com': 1,
         '40enron@enron.com': 13,
         'jae.b

In [122]:
#data cleaning for top email address
key_add = 'jeff.dasovich@enron.com'
for i in range(0,len(emailadd_list)):
    if key_add in emailadd_list[i]:
        emailadd_list[i] = key_add

In the following cell, save a dictionary with only the 20 email addresses with the most emails to the object `dict_top_addresses_sent`

**Hint:** The top email address should be `'jeff.dasovich@enron.com'` with 2,624 emails (though a slightly different number if OK)

In [123]:
dict_top_addresses_sent = Counter(emailadd_list).most_common()[:20]

In [124]:
dict_top_addresses_sent

[('jeff.dasovich@enron.com', 2624),
 ('no.address@enron.com', 2054),
 ('d..steffes@enron.com', 1745),
 ('chris.germany@enron.com', 1609),
 ('louise.kitchen@enron.com', 1509),
 ('pete.davis@enron.com', 1448),
 ('j.kaminski@enron.com', 1441),
 ('gerald.nemec@enron.com', 1435),
 ('kimberly.watson@enron.com', 1326),
 ('sara.shackleton@enron.com', 1287),
 ('40enron@enron.com', 1247),
 ('lynn.blair@enron.com', 1121),
 ('john.arnold@enron.com', 1085),
 ('marie.heard@enron.com', 1083),
 ('m..presto@enron.com', 940),
 ('rod.hayslett@enron.com', 930),
 ('barry.tycholiz@enron.com', 899),
 ('houston <.ward@enron.com>', 893),
 ('joe.parks@enron.com', 877),
 ('sally.beck@enron.com', 869)]

---------------

### Final Task: Run the following cell so your outputs can be checked for accuracy - this is a requirement

## **Save your analytics results to a json object - then add, commit, and push your notebook and json to GitHub!**

In [125]:
import json
grading_dict = {'len_paths' : len(list_email_paths),
 'email_1' : str(email_1),
 'vanilla_split_match' : str(out_vanilla == out_split),
 'df_emails' : df_emails.head(10).to_string(),
 'dict_top_email_addresses' : str(dict_top_addresses_sent)
 }

json.dump(grading_dict, fp = open('soln.json','w'))