# Lab 2 - Answering Real-World Questions Using Regular Expressions


## Due: Thursday, January 25, 2018,  11:59:00pm

### Submission instructions
After completing this homework, you will turn in two files via Canvas ->  Assignments -> Homework 2:
Your Notebook, named si330-hw2-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-hw2-YOUR_UNIQUE_NAME.html


### Name:  YOUR NAME GOES HERE
### Uniqname: YOUR UNIQNAME GOES HERE
### People you worked with: [if you didn't work with anyone else write "I worked by myself" here].


## Objectives
After completing this homework assignment, you should
* know how to use basic regular expressions
* have gained more experience with composite data structures and sorting

### Background

We will be using a larger version of the Enron email dataset that we used in this week's lab. 
This is a sample of 50,000 email messages from a large database of over 600,000 email 
messages generated by 158 employees of the Enron Corporation and acquired
by the Federal Energy Regulatory Commission during its investigation after the company's collapse. 
The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the 
Enron Corporation - one of the largest corporate bankruptcy in U.S. history. 
Using this dataset, you will be answering the following questions:

### Questions
1. Which two people exchanged the most email?
1. What fraction of the emails were replies?
1. What are the 20 most common words used in the "Subject" lines?



##### The rest of the notebook contains specific steps that you need to follow and complete.  Places where you need to do something are indicated in <font color="magenta">magenta</font>.

In [1]:
import csv
import re
from collections import defaultdict
# Fix python's limit on csv field length
import sys
maxInt = sys.maxsize
decrement = True

while decrement:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    decrement = False
    try:
        csv.field_size_limit(maxInt)
    except OverflowError:
        maxInt = int(maxInt/10)
        decrement = True


### Load the data

In [3]:
# Solution block
email_data_file_name = "email_sample_5000.csv"

email_data = []

with open(email_data_file_name, 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        email_data.append(row)
        
print(len(email_data))

5000


### About email
Email messages consist of two parts: the headers and the body.  They are separated by two newline characters (```\n\n```).
It's helpful to separate the two into a list of tuples: the list has a tuple for each email;  the first 
element of the tuple is the headers; the second is the body.  Why bother with this?  Because if we don't we might
accidentally include headers that were included as part of a reply or forward. 

In [4]:
headers_bodies = [tuple(email['message'].split('\n\n')) for email in email_data]

In [5]:
# Print the headers for the first email
print(headers_bodies[0][0])

Message-ID: <25829183.1075858373361.JavaMail.evans@thyme>
Date: Tue, 16 Jan 2001 22:06:00 -0800 (PST)
From: robin.rodrigue@enron.com
To: kori.loibl@enron.com
Subject: VAR
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Robin Rodrigue
X-To: Kori Loibl
X-cc: 
X-bcc: 
X-Folder: \Robin_Rodrique_Jun2001\Notes Folders\'sent mail
X-Origin: Rodrique-R
X-FileName: rrodri2.nsf


In [5]:
# Print the body for the first email
print(headers_bodies[0][1])

Thanks, Audrey!  Kim.


### <font color="magenta">Q1: Which two people exchanged the most email?</font>
This is mostly a repeat of the final question from this week's lab.
The main difference is that we're using a bigger data set.  Choose your regular expression for extracting 
the sender and especially the recipient ids carefully: each email message can be sent to multiple recipients.


#### Challenge (Above and Beyond)
Which headers, other than the "To:" header, can contain recipients? Note that each of those headers can contain multiple recipients, each separated by a comma.

In [16]:
#Solution block (not considering CC & BCC)
to_match = r'From: (.*?)\nTo: (.*?)\n'

conversation_count = defaultdict(int)
for email in headers_bodies:
    match = re.findall(to_match, email[0], re.DOTALL)
    if match:
        recipients = match[0][1].split(', ')
        for name in recipients:
            if name:
                exchange_pairs = tuple(sorted((match[0][0], name)))
                conversation_count[exchange_pairs] += 1

In [17]:
#Solution block (not considering CC & BCC)
sorted_conversation_count = sorted(conversation_count.items(), key = lambda x: x[1], reverse = True)

for i in sorted_conversation_count[:10]:
    print(i)

(('pete.davis@enron.com', 'pete.davis@enron.com'), 895)
(('vince.kaminski@enron.com', 'vkaminski@aol.com'), 386)
(('all.worldwide@enron.com', 'enron.announcements@enron.com'), 200)
(('kay.mann@enron.com', 'suzanne.adams@enron.com'), 190)
(('evelyn.metoyer@enron.com', 'kate.symes@enron.com'), 172)
(('shirley.crenshaw@enron.com', 'vince.kaminski@enron.com'), 161)
(('all.houston@enron.com', 'enron.announcements@enron.com'), 157)
(('kate.symes@enron.com', 'kerri.thompson@enron.com'), 150)
(('alan.comnes@enron.com', 'jeff.dasovich@enron.com'), 138)
(('ben.jacoby@enron.com', 'kay.mann@enron.com'), 137)


In [90]:
#Solution block (considering CC & BCC)
to_match = r'From: (.*?)\nTo: (.*?)\n.*?\n(Cc: (.*?)\n)?'

conversation_count = defaultdict(int)
len_rec = []
for email in headers_bodies:
    match = re.findall(to_match, email[0], re.DOTALL)
    if match:
        recipients = []
        for i in [1, 3]:
            for rec in match[0][i].split(', '):
                if rec:
                    recipients.append(rec)
        len_rec.append(len(recipients))
        for name in recipients:
            exchange_pairs = tuple(sorted((match[0][0], name)))
            conversation_count[exchange_pairs] += 1
        

In [91]:
#Solution block (considering CC & BCC)
sorted_conversation_count = sorted(conversation_count.items(), key = lambda x: x[1], reverse = True)

for i in sorted_conversation_count[:10]:
    print(i)

(('pete.davis@enron.com', 'pete.davis@enron.com'), 895)
(('craig.dean@enron.com', 'pete.davis@enron.com'), 814)
(('bert.meyers@enron.com', 'pete.davis@enron.com'), 594)
(('bill.williams.iii@enron.com', 'pete.davis@enron.com'), 537)
(('vince.kaminski@enron.com', 'vince.kaminski@enron.com'), 453)
(('vince.kaminski@enron.com', 'vkaminski@aol.com'), 431)
(('bill.williams@enron.com', 'pete.davis@enron.com'), 348)
(('albert.meyers@enron.com', 'pete.davis@enron.com'), 291)
(('shirley.crenshaw@enron.com', 'vince.kaminski@enron.com'), 261)
(('all.worldwide@enron.com', 'enron.announcements@enron.com'), 200)


### <font color="magenta">Q2: What fraction of the email messages were replies?</font>
In email messages, replies are typically indicated with a "Subject" header that starts
with "Re:".  So to answer this question you'll need to find number of email messages
whose "Subject" header contains "Re: " and then represent that number as a fraction of
the total number of email messages (i.e. divide the number of replies by the total number of messages).

##### Your output should show the fraction as a percentage with no decimal values and should look like:
```65% of the email messages were replies.``` (of course with the correct value)

__Hint__: This is similar to extracting tbe sender email address, except all we want to do is
determine whether the Subject header contains "Re: ".  We don't care about the rest of the header.

__Hint__: Try using the .format() function to make your life easier when you're generating your output.

#### Challenge (Above and Beyond)
Can you do this in 4 lines of code? 

In [34]:
#Solution block
nreplies = 0
for hb in headers_bodies:
    if re.search('Subject: re:', hb[0], re.IGNORECASE): nreplies += 1
print('{:.0%} of the email messages were replies.'.format(nreplies/len(headers_bodies)))

30% of the email messages were replies.


### <font color="magenta">Q3: What are the 20 most common words used in the "Subject" lines?

This is a bit more complex than the previous question.  In this case, we're actually interested in the
contents of the "Subject:" header.  In addition, we want to exclude strings that indicate the message is
a reply ("Re:") or a forward ("Fwd:").  Finally, we want to exclude strings that represent commonly used
"stopwords": words like "a", "an", "and", etc.

To make it a bit easier for you, we've generated a list of stopwords (we'll learn more about this next week):

In [96]:
stopWords = {'then', 'was', 'over', 'such', 'him', 'shan', 'at', 'haven', 'as', 'off', 'all', 'of', 'are', 'in', 'm', 'out', 'into', 'too', 'didn', 'wasn', "weren't", 'through', "mightn't", 'below', 'on', 'will', 'there', 'needn', 'wouldn', 'why', 'have', 'yourself', "needn't", 'having', 'am', "it's", 'by', 'itself', 'they', 'he', 'being', 'hadn', 'mustn', 'don', "she's", 'where', 'yours', 'its', 'nor', 'not', 'that', 'the', 'who', 'our', 'these', 'up', 'their', 'himself', 'a', 'about', "don't", 'has', 're', 'to', 'more', 'doesn', 'both', 'which', 'any', 'ain', 'ourselves', 'had', 'this', 'while', 'herself', 'against', 'very', 'weren', 'myself', 'been', "should've", 'what', 'can', 'or', 'your', "isn't", "wasn't", 'does', 'how', "you'll", 'she', "hasn't", "shouldn't", 'my', 'once', 's', "hadn't", 'those', 'is', 'do', 'ours', 'but', "wouldn't", 'his', 'now', 'down', 'each', 'i', 'here', 'from', 'me', 'other', 'be', 'hers', "you're", 'until', 'further', 'y', 'own', 'again', 'just', "haven't", "shan't", 'under', 'when', "doesn't", 'and', "won't", 'no', 'above', 'them', 've', 'so', 'if', 'we', 'were', 'same', 'with', 'mightn', 'ma', 'for', 'hasn', 'couldn', 'after', 'aren', 'yourselves', "you'd", 'should', "aren't", 'o', "didn't", 'themselves', 'most', 'whom', 'shouldn', 'you', 'between', "couldn't", "you've", 'an', 'because', 'before', "mustn't", 'won', 'only', 'doing', 'some', 'her', 'did', 't', 'during', 'it', 'll', 'isn', "that'll", 'd', 'theirs', 'few', 'than', '-', '&'}

The general approach that you should use is to iterate through the headers_bodies list and extracting the
contents of the Subject: header (excluding Re: and Fwd:).  Then, assuming you have the contents in a string,
split the string (using the ```.split()``` function). You should convert each of the resulting words to lowercase
and count the occurences of each one that isn't a stopword.  We've done that a lot using ```defaultdict(int)```.

Here's some code to get you started.  It assumes you're in the middle of looping though the headers and bodies
list.  
```
subject = re.search(...) # we're not going to tell you how to do this :)
lower_subject = subject.lower()
words = lower_subject.split()
for word in words:
    if word not in stopWords:
        # do something with this word
```

In [41]:
# insert your code here

#### Print out the most commonly used words

In [97]:
# Solution code
# This solution ignore the contents of emails which were replies or forwards.
# We have also given points to answers where "re" or "fw" were treated as stopwords
to_match = r'Subject: (?!(Re:)|(Fw:)|(Fwd:))(.+?)\n'

words_dict = defaultdict(int)
for email in headers_bodies:
    match = re.findall(to_match, email[0], re.IGNORECASE) 
    if match:
        subject = match[0][3].lower().split()
        for word in subject:
            if word not in stopWords:
                words_dict[word] += 1

In [98]:
# Solution code
sorted_words_dict = sorted(words_dict.items(), key=lambda x: x[1], reverse = True)

for k, v in sorted_words_dict[:20]:
    print(k, v)

enron 1279
meeting 949
hourahead 883
new 773
report 760
start 728
date: 716
hour: 710
2001 677
gas 642
update 630
energy 629
<codesite> 595
power 594
2000 575
agreement 553
request 516
conference 408
schedule 395
call 366
