## Assignment - Single Machine Parallelization

In this assignment we will be investigating the Enron email corpus from the 2002 Federal Energy Regulatory Commission (FERB) investigation.  This corpus of emails is still used to train AI on topics related to corporate communication, as this is one of the few public datasets available. Though there are AI ethics questions if the data is used for language purposes without proper precautions. [Read this article](https://qz.com/work/1546565/the-emails-that-brought-down-enron-still-shape-our-daily-lives/) to learn more about the dataset and its implications for AI.

Start off by choosing the `ml.m5.xlarge` instance as your kernel. Then start conducting the lab.

---

### Task 0: Download dataset and explore it

The research dataset is located here, maintained by Carnegie Mellon University, https://www.cs.cmu.edu/~enron/. Though due to its size we have created a smaller version on S3. Follow these steps to get the file:

1. Copy the file from the class's public S3 bucket into your own S3 bucket. You must change the bucket name to yours with your net id. `aws s3 cp s3://bigdatateaching/enron/maildir.zip s3://[NET_ID]-labdata`

2. Copy the file from your own S3 bucket to SageMaker. Use the boto3 package for this. This will place the `maildir` directory from the zip file into the tmp folder of your EC2.

In [6]:
import boto3
my_bucket = 'mc2582-labdata'
my_file = 'maildir.zip'

s3client = boto3.client('s3')
s3client.download_file(my_bucket, my_file, '/tmp/maildir.zip')

3. Run the following command in a new cell to unzip the directory. This could take a while, there are almost 80,000 files!! We are explicitly saving to the /tmp file share because it is part of your EC2. If you use your git repository to store the maildir zip or unpacked files, we would be here a while....

In [7]:
!cd /tmp; unzip -q /tmp/maildir.zip

The bullets below represent the file structure of interest from the Enron dataset. To confirm that you have unzipped and placed the data in the correct location, click on the links of files, e.g. `1_`, `2_`, etc., which will open the text files for you to explore. Notice that these are all raw text files.

* [maildir](./maildir)
 * [allen-p](./maildir/allen-p)
   * [inbox](./maildir/allen-p/inbox)
       * [1_](./maildir/allen-p/inbox/1_)
       * [2_](./maildir/allen-p/inbox/2_)
       * [3_](./maildir/allen-p/inbox/3_)
       ...
   * [sent_items](./maildir/allen-p/sent_items)
       * [1_](./maildir/allen-p/sent_items/1_)
       * [2_](./maildir/allen-p/sent_items/2_)
       * [3_](./maildir/allen-p/sent_items/3_)
       * etc...
   * etc...
 * [arnold-j](./maildir/arnold-j)
   * [inbox](./maildir/arnold-j/inbox)
       * [1_](./maildir/arnold-j/inbox/1_)
       * [2_](./maildir/arnold-j/inbox/2_)
       * [3_](./maildir/arnold-j/inbox/3_)
       * etc ...
   * [sent_items](./maildir/arnold-j/sent_items)
       * [1_](./maildir/arnold-j/sent_items/1_)
       * [2_](./maildir/arnold-j/sent_items/2_)
       * [3_](./maildir/arnold-j/sent_items/3_)
       * etc...
   * etc...
 * [arora-h](./maildir/arora-h)
   * [inbox](./maildir/arora-h/inbox)
       * [1_](./maildir/arora-h/inbox/1_)
       * [2_](./maildir/arora-h/inbox/2_)
       * [3_](./maildir/arora-h/inbox/3_)
       * etc...
   * [sent_items](./maildir/arora-h/sent_items)
       * [1_](./maildir/arora-h/sent_items/1_)
       * [2_](./maildir/arora-h/sent_items/2_)
       * [3_](./maildir/arora-h/sent_items/3_)
       * etc...
   * etc...
 * etc... etc...

---------------------------

### Task #1: Collect all of the file paths of emails in the maildir (2 points)

The output object will be called `list_emails` should contain 79,429 items and look similar to the following:
```
['/tmp/maildir/allen-p/inbox/10_',
 '/tmp/maildir/allen-p/inbox/11_',
 '/tmp/maildir/allen-p/inbox/12_',
 '/tmp/maildir/allen-p/inbox/13_',
 '/tmp/maildir/allen-p/inbox/14_',
 '/tmp/maildir/allen-p/inbox/15_',
 '/tmp/maildir/allen-p/inbox/16_',
 '/tmp/maildir/allen-p/inbox/17_',
 '/tmp/maildir/allen-p/inbox/18_',
 '/tmp/maildir/allen-p/inbox/19_',
 ...
 ]
```

The broad strokes to creating this list goes as follows:

- use the command `os.listdir('/tmp/maildir')` to see the folders in the mail directory
- then use the command `os.listdir('/tmp/maildir/allen-p')` to see the folder in Allen P's folder
- then use the command `os.listdir('/tmp/maildir/allen-p/inbox')` to see the contents in Allen P's inbox and collect the full file paths
- you need to also collect all the files in both the inbox and sent_items folders for Allen P, do not track emails from the other folders
- next move onto Arnold J!
- next move onto Arora H!
- ....

Light bulbs should start going off in your head at this point about ways to solve this problem. This is a repeated process that needs to happen many many times. A for loop is the simplest way to solve a repetitive task. **Use for loops** to collect all 79,429 file paths into a single list.

**Hint:** look up `os.path.join()` or f-strings if you are having trouble creating the appropriate argument for the `os.listdir()` command

In the following cell(s) produce the list of emails as directed and save the result to the object `list_email_paths`

In [27]:
import os
list_email_paths = []
for member in os.listdir('/tmp/maildir'):
    if os.path.exists('/tmp/maildir/'+member+'/inbox'):
        for number in os.listdir('/tmp/maildir/'+member+'/inbox'):
            if number[-1]=='_':
                list_email_paths.append('/tmp/maildir/'+member+'/inbox/'+number)
    if os.path.exists('/tmp/maildir/'+member+'/sent_items'):
        for number in os.listdir('/tmp/maildir/'+member+'/sent_items'):
            if number[-1]=='_':
                list_email_paths.append('/tmp/maildir/'+member+'/sent_items/'+number)

In the following cell(s) print the first 10 items of the list as well as the length of the list

In [29]:
for i in range(10):
    print(list_email_paths[i])

/tmp/maildir/allen-p/inbox/10_
/tmp/maildir/allen-p/inbox/11_
/tmp/maildir/allen-p/inbox/12_
/tmp/maildir/allen-p/inbox/13_
/tmp/maildir/allen-p/inbox/14_
/tmp/maildir/allen-p/inbox/15_
/tmp/maildir/allen-p/inbox/16_
/tmp/maildir/allen-p/inbox/17_
/tmp/maildir/allen-p/inbox/18_
/tmp/maildir/allen-p/inbox/19_


In [30]:
len(list_email_paths)

79429

--------------------

### Task 2: Read in one text email and parse out relevant information using a function (2 points)

Before you get to parallelizing, you need to build the function that will execute on every email in your list. The objective of this task is to build the function called `email_process()` which will take as an input the path of an email and output the following information in a dictionary:

* Email of the sender
* Email(s) of the recipient
* Email timestamp
* Email subject
* Email body

An example output from the function on the first email in Allen P's inbox looks like the following:

```
{'from': 'heather.dunton@enron.com',
 'to': 'k..allen@enron.com',
 'timestamp': 'Fri, 7 Dec 2001 10:06:42 -0800 (PST)'
 'subject': 'RE: West Position',
 'body': ' \nPlease let me know if you still need Curve Shift.\n\nThanks,\nHeather\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tFriday, December 07, 2001 5:14 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nDid you attach the file to this email?\n\n -----Original Message-----\nFrom: \tDunton, Heather  \nSent:\tWednesday, December 05, 2001 1:43 PM\nTo:\tAllen, Phillip K.; Belden, Tim\nSubject:\tFW: West Position\n\nAttached is the Delta position for 1/16, 1/30, 6/19, 7/13, 9/21\n\n\n -----Original Message-----\nFrom: \tAllen, Phillip K.  \nSent:\tWednesday, December 05, 2001 6:41 AM\nTo:\tDunton, Heather\nSubject:\tRE: West Position\n\nHeather,\n\nThis is exactly what we need.  Would it possible to add the prior day for each of the dates below to the pivot table.  In order to validate the curve shift on the dates below we also need the prior days ending positions.\n\nThank you,\n\nPhillip Allen\n\n -----Original Message-----\nFrom: \tDunton, Heather  \nSent:\tTuesday, December 04, 2001 3:12 PM\nTo:\tBelden, Tim; Allen, Phillip K.\nCc:\tDriscoll, Michael M.\nSubject:\tWest Position\n\n\nAttached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24\n\n\n\n << File: west_delta_pos.xls >> \n\nLet me know if you have any questions.\n\n\nHeather'
 }
```

To process the raw text file, we recommend that you use the `email.parser.Parser()` function. Read documentation on it [here](https://docs.python.org/3/library/email.parser.html).

The to field will need some cleaning. Remove any carriage returns, tabs, or spaces from the to field. 

It's OK that the email body is really messy right now. A future task is to clean it up.

In the following cell(s) produce the function `email_process()`

In [90]:
import email
def email_process(path):
    try:
        ema = email.parser.Parser().parse(open(path, encoding = 'utf-8', errors = 'ignore'))
        result = {'from': ema['From'], 'to': ema['To'], 'timestamp': ema['Date'], 'subject': ema['Subject'], 'body': ema.get_payload()}
        return result
    except:
        print(path)


In the following cell, save a new object called `email_1` as the output for the file `maildir/allen-p/inbox/1_`

In [91]:
email_1 = email_process('/tmp/maildir/allen-p/inbox/1_')

----------------

### Task 3: Read in all emails using multiprocessing (3 points)

There are two ways to implement multiprocessing on a list of data:

1. **The vanilla approach to parallelization** involves getting a list of the items to process, then using the multiple worker processes to complete all the items in parallel like queue system. The list gets processes by workers until there are no items left.
2. **The splitting approach to parallelization** involves getting a list of items to process, splitting that list into a number of sublists, then split each sublist to a worker. Each worker then processes the sublist in serial.

For more background on these two ideas, check out this useful [medium post](https://medium.com/idealo-tech-blog/parallelisation-in-python-an-alternative-approach-b2749b49a1e).

**If you are using a Windows machine (not in this lab)**, you will be unable to save your function in a cell and then run with multiprocesing. Instead, save two scripts `mail_functions_vanilla.py` and `mail_functions_split.py` with the functions needed for each approach. Import the module that you will execute in parallel by calling `import mail_functions_vanilla` and then calling a function within that module, just like you would any other module in python.

Use the `time` module to track the time taken for each approach. Note that you may have files that are corrupted and you will have to figure out how to handle them gracefully.

In the following cell(s), implement the vanilla approach to parallelization to process all emails using `pool.map()`. Save the output list to an object called `out_vanilla`.

In [92]:
import multiprocessing as mp
import time
start = time.time()

pool = mp.Pool(mp.cpu_count())
t1 = time.time()
out_vanilla = pool.map(email_process , list_email_paths)

t2 = time.time()
pool.close()
end = time.time()

print('total time',end - start)
print('time to set up pool',t1 - start)
print('time to multiprocess',t2 - t1)

total time 13.539299011230469
time to set up pool 0.08559441566467285
time to multiprocess 13.45361876487732


In the following cell(s), implement the splitting approach to parallelization to process all emails using `pool.map()`. Save the output list to an object called `out_split`.

In [100]:
import numpy as np
def actual_task(sublist_of_paths):
    sublist_results = []
    for path in sublist_of_paths:
        sublist_results.append(email_process(path))
    return sublist_results

def splitter(list_ids, NUM_WORKERS):
    list_list_ids = []
    for i in np.array_split(list_ids, NUM_WORKERS): 
        list_list_ids.append(list(i))
    return list_list_ids

start = time.time()
ids = splitter(list_email_paths, mp.cpu_count())

pool = mp.Pool(mp.cpu_count()) 
t1 = time.time()
res_list = pool.map(actual_task, ids)
t2 = time.time()
pool.close()
out_split = [item for sublist in res_list for item in sublist]
end = time.time()

print('total time', end - start)
print('time to set up pool',t1 - start)
print('time to multiprocess',t2 - t1)

total time 14.073877573013306
time to set up pool 0.14962458610534668
time to multiprocess 13.919296503067017


In the following cell(s), confirm that the results from each method are the same. If they are not, then figure out how to make them the same.

In [101]:
out_vanilla == out_split

True

In the following cell(s), convert the list `out_split` to a Pandas DataFrame and save it as `df_emails`

In [103]:
import pandas as pd
df_emails = pd.DataFrame(out_split)

-------------

### Task 4: Find the most frequent emailers in the dataset (3 points)

We want to know which people sent or received the most emails in the dataset. Use your Pandas dataframe fields `to` and `from` to count the number of emails where an email address appears in the data.

**You must use a dictionary** to keep track of the connections between email addresses. This will involve looping through each row and countering the dictionary up for each email instance found. For example, if we processed rows of the data and found that abc@enron.com sent one email to def@enron.com and ghi@enron.com, another sent email to jkl@enron.com as well as received an email from ghi@enron.com. This would result in a dictionary like so:

```
{
 'abc@enron.com' : 3,
 'ghi@enron.com' : 2,
 'def@enron.com' : 1,
 'jkl@enron.com' : 1
}
```

Note that we are counting the email address whether it is the recipient or the sender of the email. It is certainly possible that both sides of the email conversation appears in the dataset, but it is OK to count that twice. The `To` field can contain multiple emails. You have to parse and count each email recipient separately!

This process might require a lot of computing power to process every email! You must use `pool.map()` or another `multiprocessing` module function for the mapping part of the problem. The output will be a list of dictionaries like the example above. You will write a function to execute on a small chunk of the dataframe. After you have the list of dictionaries from each chunk, reduce them into a single dictionary.

**BONUS:** 1 point - implement the same map and reduce operations leveraging pandas vectorized operations and groupby-summarize functions.


**Hint:** break your dataframe into a list of smaller dataframes, and use that list to pass through `pool.map()`

Use the following cell(s) to answer this question:

In [143]:
def row_join(x):
    if x[1]!=None:
        return (x[0]+', '+x[1])
    else:
        return x[0]
merged_list = df_emails[['from','to']].apply(row_join, axis=1)

In [156]:
from collections import Counter
def count_recipients(email_str):
    temp_Counter = Counter([word.strip() for word in email_str.split(',')])
    return {key: value for key, value in temp_Counter.items()}
temp = count_recipients(merged_list[3])

In [167]:
def dictionary_merge(dict_a, dict_b):
    for key in dict_b:
        if key in dict_a:
            dict_a[key] = dict_a[key] + dict_b[key]
        else:
            dict_a[key] = dict_b[key]
    return dict_a

In [163]:
pool = mp.Pool(mp.cpu_count())
result_dictionaries = pool.map(count_recipients , merged_list)
pool.close()

In [171]:
from functools import reduce
final_dict = reduce(dictionary_merge, result_dictionaries)

In [172]:
sorted_dict = dict(sorted(final_dict.items(), key=lambda item: item[1], reverse=True))

In the following cell, save a dictionary with only the 20 email addresses with the most emails to the object `dict_top_addresses_sent`

**Hint:** The top email address should be `'jeff.dasovich@enron.com'` with 2,624 emails (though a slightly different number if OK)

In [177]:
dict_top_addresses_sent = {}
for key, value in sorted_dict.items():
    if len(dict_top_addresses_sent) < 20:
        dict_top_addresses_sent[key]=value
    else:
        break

In [180]:
print(dict_top_addresses_sent)

{'jeff.dasovich@enron.com': 2624, 'd..steffes@enron.com': 2402, 'louise.kitchen@enron.com': 2248, 'chris.germany@enron.com': 2136, 'gerald.nemec@enron.com': 2115, 'no.address@enron.com': 2054, 'sara.shackleton@enron.com': 1989, 'kimberly.watson@enron.com': 1883, 'm..presto@enron.com': 1743, 'matthew.lenhart@enron.com': 1670, 'marie.heard@enron.com': 1670, 'barry.tycholiz@enron.com': 1644, 'john.lavorato@enron.com': 1578, 'pete.davis@enron.com': 1540, 'rick.buy@enron.com': 1539, 'tana.jones@enron.com': 1538, 'j.kaminski@enron.com': 1534, 'john.arnold@enron.com': 1521, 'sally.beck@enron.com': 1518, 'kam.keiser@enron.com': 1469}


---------------

### Final Task: Run the following cell so your outputs can be checked for accuracy - this is a requirement

## **Save your analytics results to a json object - then add, commit, and push your notebook and json to GitHub!**

In [181]:
import json
grading_dict = {'len_paths' : len(list_email_paths),
 'email_1' : str(email_1),
 'vanilla_split_match' : str(out_vanilla == out_split),
 'df_emails' : df_emails.head(10).to_string(),
 'dict_top_email_addresses' : str(dict_top_addresses_sent)
 }

json.dump(grading_dict, fp = open('soln.json','w'))