# Capstone Project 1: Data Wrangling

**Our Problem:** 

Can emails be used to identify the author's gender?

## Overview
For this notebook, our **goal** will be to Acquire, Wrangle, and Clean data from the Enron email dataset.

## I. Acquiring the Data
The data we'll be working with comes in two forms:
- The Enron email dataset
- Assumed gender by name

The Enron email dataset we'll be working with can be downloaded online from the Carnegie Mellon University School of Computer Science (https://www.cs.cmu.edu/~./enron/) as of March 28, 2020. It comes convieniently zipped under the filename `enron_mail_20150507.tar.gz`.

To collect the assumed genders, by name, we'll need to isolate the sender names and decide on a method of identifying gender. We'll return to this once the dataset is cleaned.

### Acquiring the Data: Enron Emails

Once unzipped, the initial file is `maildir`; we'll call this the *mail directory* (I know - creative). The hierarchy in this file directory goes as such:
- Mail Directory ('maildir')
 - Employee Email Folders
  - disorganized mess (folders + emails)

Let's use the `os` library to explore the directory with its function `os.scandir()`. We'll evaluate each object returned by `os.scandir()`; if it is another folder, we'll `.append()` it to our list so it will also get explored.

Once all the folders have been collected, the second part of the function will iterate through the complete `folder_list` and compile a `file_list` of email file names.

In [None]:
import os

# setup function to return filenames from directory
def file_grabber(some_directory):
    """Returns all files from directory
    and child directories by creating, appending
    a directory list until all directories are located.
    
    Then, cycles through each directory and records any
    filenames to a list. This list is returned on completion."""
    
    # initialize folder list
    folder_list = []

    # appends starting folder to folder list
    folder_list.append(some_directory)
    
    # iterate through folder list
    for folder in folder_list:
        
        # open content manager with .scandir() on folder
        with os.scandir(folder) as open_folder:

            # for each item in the directory folder
            for thing in open_folder:

                # if the thing is another folder
                if thing.is_dir():
                    thing_path = (folder + thing.name + '/')
                    folder_list.append(thing_path)
                else:
                    continue
      
    # print out number of folders
    print('{} folders found.'.format(len(folder_list)))
    
    # initialize file list
    file_list = []
    
    # iterate through folder list to collect filenames
    for folder in folder_list:
        
        # open content manager with .scandir() on folder
        with os.scandir(folder) as open_folder:

            # for each item in the directory folder
            for thing in open_folder:

                # if the thing is another folder
                if thing.is_file():
                    thing_path = (folder + thing.name)
                    file_list.append(thing_path)
                else:
                    continue
                
    # print out report                    
    print('{} files found.'.format(len(file_list)))
    
    # return files as list
    return file_list

***Function complete.*** Now, let's put it into action!

In [2]:
# initialize list to collect filenames
email_files = []

# create base string
base_name = './data/maildir/'
    
# call function to collect file names
email_files = file_grabber(base_name)

3500 folders found.
517401 files found.


***Nice!*** 517401 files found! Let's sort the list and look at some of the filepaths returned.

In [3]:
# sort the list
email_files.sort()

# take a peek
email_files[:5]

['./data/maildir/allen-p/_sent_mail/1',
 './data/maildir/allen-p/_sent_mail/10',
 './data/maildir/allen-p/_sent_mail/100',
 './data/maildir/allen-p/_sent_mail/1000',
 './data/maildir/allen-p/_sent_mail/1001']

## II. Wrangling the Data

> Data Wrangling:
>
> "The process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as **analytics**." - Wikipedia, Data wrangling

Let's begin by exploring the *current state of the data*. We'll make notes, and decide what ***wrangling*** we'll need to do. 

The dataset itself is over 0.5M emails, comes as a *bunch of folders* containing a **bunch of emails** and the unzipped form we'll be working with weighs in around 1.4GB. Let's review an email and isolate some parts to wrangle.

In [4]:
# email reader
def read_email(email_path):
    """returns email body as a body of text"""
    
    # file manager opens email file, assigns it to variable
    with open(email_path) as email_file:
        email_body = email_file.read()
    
    # returns email body as text
    return email_body

In [5]:
# email text by line as list
print(read_email(email_files[10]))

Message-ID: <33076797.1075855687515.JavaMail.evans@thyme>
Date: Mon, 16 Oct 2000 06:42:00 -0700 (PDT)
From: phillip.allen@enron.com
To: buck.buckner@honeywell.com
Subject: Re: FW: fixed forward or other Collar floor gas price terms
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: "Buckner, Buck" <buck.buckner@honeywell.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

Mr. Buckner,

 For delivered gas behind San Diego, Enron Energy Services is the appropriate 
Enron entity.  I have forwarded your request to Zarin Imam at EES.  Her phone 
number is 713-853-7107.  

Phillip Allen


### ***To evaluate our problem,*** we need gender and email text.

From the email above, we'll collect the `From:` field to capture names (and subsequently, make some assumptions on gender). We'll also need to isolate the email body from the rest of the header information contained in the file. Finally, to keep the data subsets tied to their original forms, we'll preserve the file directory name as an index. 

Let's tackle this goal in two steps:
- Collect the variables of interest in a dictionary using the `re` library
- Convert the dictionary to a DataFrame using `pd.DataFrame.from_dict()`



In [6]:
# libraries
import re

# custom function
def save_to_dict(email_path):
    """create dictionary from a list of filepath files
       returns dictionary"""

    # open email file, get email text
    with open(email_path) as open_email:
        
        # get file text
        email_text = open_email.read()
        
        # get 'Message-ID'
        m_id = ''
        catch = m_id_pat.search(email_text)
        if catch:
            m_id = catch[0]         
    
        # get 'Date'
        m_date = ''
        catch = m_date_pat.search(email_text)
        if catch:
            m_date = catch[0]     
        
        # get 'From'
        m_from = ''
        catch = m_from_pat.search(email_text)
        if catch:
            m_from = catch[0] 
        
        # get 'To'
        m_to = ''
        catch = m_to_pat.search(email_text)
        if catch:
            m_to = catch[0] 

        # get 'Cc'
        m_cc = ''
        catch = m_cc_pat.search(email_text)
        if catch:
            m_cc = catch[0] 
        
        # get 'Bcc'
        m_bcc = ''
        catch = m_bcc_pat.search(email_text)
        if catch:
            m_bcc = catch[0] 

        # get 'Subject'
        m_subj = ''
        catch = m_subj_pat.search(email_text)
        if catch:
            m_subj = catch[0] 

        # get 'Mime-Version'
        mime_vers = ''
        catch = mime_vers_pat.search(email_text)
        if catch:
            mime_vers = catch[0] 

            # get 'Content-Type'
        cont_type = ''
        catch = cont_type_pat.search(email_text)
        if catch:
            cont_type = catch[0] 

        # get 'Content-Transfer-Encoding'
        encode = ''
        catch = encode_pat.search(email_text)
        if catch:
            encode = catch[0] 

        # get 'X-From'
        x_from = ''
        catch = x_from_pat.search(email_text)
        if catch:
            x_from = catch[0] 

        # get 'X-To'
        x_to = ''
        catch = x_to_pat.search(email_text)
        if catch:
            x_to = catch[0] 

        # get 'X-cc'
        x_cc = ''
        catch = x_cc_pat.search(email_text)
        if catch:
            x_cc = catch[0] 

        # get 'X-bcc'
        x_bcc = ''
        catch = x_bcc_pat.search(email_text)
        if catch:
            x_bcc = catch[0] 

        # get 'X-Folder'
        x_fold = ''
        catch = x_fold_pat.search(email_text)
        if catch:
            x_fold = catch[0] 

        # get 'X-Origin'
        x_orig = ''
        catch = x_orig_pat.search(email_text)
        if catch:
            x_orig = catch[0] 

        # get 'X-Filename'
        x_fname = ''
        catch = x_fname_pat.search(email_text)
        if catch:
            x_fname = catch[0]
    
        # get body
        m_body = ''
        catch = m_body_pat.split(email_text, 1)
        if catch:
            m_body = catch[1]
        
        # create dictionary entry
        wrangle_dict[email_path] = [m_id, m_date, m_from, m_to, 
                                    m_cc, m_bcc, m_subj, mime_vers, 
                                    cont_type, encode, x_from, x_to, 
                                    x_cc, x_bcc, x_fold, x_orig, 
                                    x_fname, m_body]

In [7]:
%%time
# import re library
import re
import concurrent.futures

# set regex patterns 'Message-ID'
m_id_pat = re.compile('(?<=Message-ID: )[^\n]*')

# set regex patterns 'Date'
m_date_pat = re.compile('(?<=\nDate: )[^\n]*')
    
# set regex patterns 'From'
m_from_pat = re.compile('(?<=\nFrom: )[^\n]*')

# set regex patterns 'To'
m_to_pat = re.compile('(?<=\nTo: )[^\n]*')

# set regex patterns 'Cc'
m_cc_pat = re.compile('(?<=\nCc: )[^\n]*')

# set regex patterns 'Bcc'
m_bcc_pat = re.compile('(?<=\nBcc: )[^\n]*')

# set regex patterns 'Subject'
m_subj_pat = re.compile('(?<=\nSubject: )[^\n]*')

# set regex patterns 'Mime-Version'
mime_vers_pat = re.compile('(?<=\nMime-Version: )[^\n]*')

# set regex patterns 'Content-Type'
cont_type_pat = re.compile('(?<=\nContent-Type: )[^\n]*')

# set regex patterns 'Content-Transfer-Encoding'
encode_pat = re.compile('(?<=\nContent-Transfer-Encoding: )[^\n]*')

# set regex patterns 'X-From'
x_from_pat = re.compile('(?<=\nX-From: )[^\n]*')

# set regex patterns 'X-To'
x_to_pat = re.compile('(?<=\nX-To: )[^\n]*')

# set regex patterns 'X-cc'
x_cc_pat = re.compile('(?<=\nX-cc: )[^\n]*')

# set regex patterns 'X-bcc'
x_bcc_pat = re.compile('(?<=\nX-bcc: )[^\n]*')

# set regex patterns 'X-Folder'
x_fold_pat = re.compile('(?<=\nX-Folder: )[^\n]*')

# set regex patterns 'X-Origin'
x_orig_pat = re.compile('(?<=\nX-Origin: )[^\n]*')

# set regex to return X-Filename, email body
x_fname_pat = re.compile('(?<=\nX-FileName: )[^\n]*\n\n')

# set regex patterns 'Body'
m_body_pat = re.compile('\nX-FileName: [^\n]*\n\n')

# initialize dictionary
wrangle_dict = {}

# concurrent futures executor
with concurrent.futures.ThreadPoolExecutor() as executor:  
    future = executor.map(save_to_dict, email_files)


Wall time: 5min 33s


In [8]:
len(wrangle_dict)

517401

In [429]:
%%time
# import pandas
import pandas as pd

# create dataframe from dictionary
wrangle_frame = pd.DataFrame.from_dict(wrangle_dict, orient='index', columns = ['m_id', 'm_date', 'm_from', 'm_to', 
                                                                                'm_cc', 'm_bcc', 'm_subj', 'mime_vers', 
                                                                                'cont_type', 'encode', 'x_from', 'x_to', 
                                                                                'x_cc', 'x_bcc', 'x_fold', 'x_orig', 
                                                                                'x_fname', 'm_body'])

# sort file directory names, pop it out to column
wrangle_frame = wrangle_frame.sort_index().reset_index()

# rename index column 'f_dir'
wrangle_frame = wrangle_frame.rename(columns = {'index' : 'f_dir'})

# export master to csv
wrangle_frame.to_csv('./data/00_original_wrangle.csv', index=False, index_label=False)

# check DataFrame
wrangle_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517401 entries, 0 to 517400
Data columns (total 19 columns):
f_dir        517401 non-null object
m_id         517401 non-null object
m_date       517401 non-null object
m_from       517401 non-null object
m_to         517401 non-null object
m_cc         517401 non-null object
m_bcc        517401 non-null object
m_subj       517401 non-null object
mime_vers    517401 non-null object
cont_type    517401 non-null object
encode       517401 non-null object
x_from       517401 non-null object
x_to         517401 non-null object
x_cc         517401 non-null object
x_bcc        517401 non-null object
x_fold       517401 non-null object
x_orig       517401 non-null object
x_fname      517401 non-null object
m_body       517401 non-null object
dtypes: object(19)
memory usage: 75.0+ MB
Wall time: 29.9 s
