# Using Python to Access Your E-mail

Python has several modules in its standard library to work with e-mail and other web tools. Using Python you can compose and send e-mails as well as retrieve e-mails from mail servers and parse the content of e-mail files. In this notebook we will do the later two.

For the example here I am going to use my University of Utah e-mail. Similar approaches can be used for other e-mail providers. For example, [here](https://developers.google.com/gmail/api/quickstart/python) are instructions from Google about how to interact with Gmail via Python.

In [6]:
import os
DATADIR = os.path.join(os.getcwd())
import csv
import imaplib
import getpass
import email
from collections import defaultdict
import gzip
import pickle


# Working with E-Mail
* Python has several modules for working with e-mail, including sending e-mails (not goint o talk about this), working with an inbox, and parsing e-mail messages
* [imaplib](http://docs.python.org/2/library/imaplib.html)
* Below is a code snippet adapted from the Python documentation
* Some notes:
    * **``getpass.getpass()``** prompts for a password without echoing it back to the screen
    * also **``getpass.getuser()``**; 

### Here is a script to connect to and pull e-mails from UMail

This was very slow for me, probably largely because I don't delete enough e-mails.

* [``imaplib.IMAP4_SSl``](https://docs.python.org/3/library/imaplib.html): "This is a subclass derived from IMAP4 that connects over an SSL encrypted socket."
* [``getpass.getpass``](https://docs.python.org/3.5/library/getpass.html) ``getpass`` allows us to get passwords (or other text) that we don't want echoed back to the screen. As best practice pass the results of getpass directly to the function that needs the password so that you don't have a variable floating around withe sensitive information.

In [8]:

M = imaplib.IMAP4_SSL("imap.umail.utah.edu",port=993)
M.login('%s@umail.utah.edu'%getpass.getpass("Enter your University of Utah ID").strip(),
        getpass.getpass("Enter your University of Utah password").strip())
M.select()
typ,data = M.search(None,"ALL")
msgs = {}
count = 0
for num in data[0].split():
    count += 1
    # My inbox had around 12000 messages in it. 
    # this was a way to keep my up to date on whether
    # my program was really progressing
    if count %500 == 0:
        print (num),
    typ,data = M.fetch(num,'(RFC822)')
    msgs[num] = data


Enter your University of Utah ID········
Enter your University of Utah password········


### Write everything out to a pickle file

* I don't want to query my e-mail very often since it is so slow, so let's save the data for later use.

In [9]:
with gzip.open(os.path.join(DATADIR,
               "myEmail11192017.pickle.gzip"),"wb") as fo:
    pickle.dump(msgs,fo)

#### If we want to start over we can just read in the pickle file and skip the IMAP step

In [10]:
with gzip.open(os.path.join(DATADIR,
               "myEmail11192017.pickle.gzip"),"rb") as fo:
    msgs = pickle.load(fo)

# Parsing e-mail messages
* [email:](https://docs.python.org/3/library/email.html#module-email)

>The email package is a library for managing email messages, including MIME and other RFC 2822-based message documents. It is specifically not designed to do any sending of email messages to SMTP (RFC 2821), NNTP, or other servers; those are functions of modules such as smtplib and nntplib. The email package attempts to be as RFC-compliant as possible, supporting in addition to RFC 2822, such MIME-related RFCs as RFC 2045, RFC 2046, RFC 2047, and RFC 2231. (Python Documentation)

## Read e-mails and save 'From'/'to' and `date` information
### Always some unicode confusion



### What does a message look like?

In [11]:
m = msgs[b'9381']
type(m), len(m)

KeyError: b'9381'

In [12]:
print(m)

NameError: name 'm' is not defined

#### The message is a two-tuple
* The first element is another tuple 
    * The first element of which is some index information.
    * The second element is a big, nasty string.
* The second element is a string describing the message status

In [None]:
print("The length of the message tuple is %s"%len(m))
print(m[0][0],m[1])


In [None]:
print(m[0][1].decode()[0:350])


### Now we need to Parse the messages

* Create an e-mail parser
* Take a look at what a parsed message looks like

* [parsestr](https://docs.python.org/3/library/email.parser.html#email.parser.Parser.parsestr)

* Lots and lots of header information
* Text of e-mail is buried in a bunch of HTML that would have to be parsed.

In [15]:
p = email.parser.Parser()
e = p.parsestr(m[0][1].decode())
print(e.keys())


['Received', 'From', 'To', 'Subject', 'Thread-Topic', 'Thread-Index', 'Date', 'Message-ID', 'References', 'In-Reply-To', 'Accept-Language', 'Content-Language', 'X-MS-Exchange-Organization-AuthAs', 'X-MS-Exchange-Organization-AuthMechanism', 'X-MS-Exchange-Organization-AuthSource', 'X-MS-Has-Attach', 'X-MS-Exchange-Organization-SCL', 'X-MS-TNEF-Correlator', 'Content-Type', 'Content-ID', 'Content-Transfer-Encoding', 'MIME-Version']


In [16]:
for k,v in e.items():
    print(k)
    print(v)
    print()

Received
from X-MB9.xds.umail.utah.edu ([169.254.13.25]) by
 X-HUB4.xds.umail.utah.edu ([155.97.144.94]) with mapi id 14.03.0361.001; Wed,
 22 Nov 2017 12:39:50 -0700

From
Barbara Saffel <barbara.saffel@utah.edu>

To
MICHAEL THOMAS WATKINS <michael.watkins@utah.edu>

Subject
Re: registration

Thread-Topic
registration

Thread-Index
AQHTY6moIBkocHBTyEWsXyEvxmCepKMgkmwXgAA5kYA=

Date
Wed, 22 Nov 2017 12:39:49 -0700

Message-ID
<FFA4D5E0-D9E9-46FB-8988-E8A2C4FA7B60@utah.edu>

References
<DEA4814A-67A9-4A6C-B44D-DE88228FBBB3@utah.edu>
 <F111E3A240FA8045A76F09AF9B3302F625D5A14F@X-MB2.xds.umail.utah.edu>

In-Reply-To

 <F111E3A240FA8045A76F09AF9B3302F625D5A14F@X-MB2.xds.umail.utah.edu>

Accept-Language
en-US

Content-Language
en-US

X-MS-Exchange-Organization-AuthAs
Internal

X-MS-Exchange-Organization-AuthMechanism
04

X-MS-Exchange-Organization-AuthSource
X-HUB4.xds.umail.utah.edu

X-MS-Has-Attach


X-MS-Exchange-Organization-SCL
-1

X-MS-TNEF-Correlator


Content-Type
text/html; char

In [17]:
e["date"]

'Wed, 22 Nov 2017 12:39:49 -0700'

In [18]:
import re
from itertools import product
rclean = re.compile(r"""\s+""")
remail = re.compile(r"""<(?P<email>\S+@\S+)>""")

#### How do we want to simplify our data?

* No consistency in how names are provided (e.g. "Yiling Bi" or "Bi, Yiling")
* `From` is from one person
* `To` can be to one to many people
    * Sometimes I find blank entries for both "To" and "From"
    
#### In the cell below I'm doing the following:

* I'm only keeping "To", "From", and "date" information
* I'm going to identify each recipient in the "To" list using a regular expression and make a node/edge relationship for each person in the "To" list.
* Write these out into a tab delimited file


In [19]:

with open(os.path.join(DATADIR,
            "my_emails_2017.txt"),"wt") as fo:
    for key in msgs.keys():
        m = msgs[key]
        try:
            e = p.parsestr(m[0][1].decode())
        except UnicodeDecodeError:
            e = p.parsestr(m[0][1].decode('windows-1252'))
        if e["To"] and e["From"]:
            for f,t in product(remail.findall(e["From"]), remail.findall(e["To"])):
                fo.write("%s\t%s\t%s\n"%(f, t, e["date"]))