## Message extraction from multiple `.msg` files

This analysis extracts both meta data, such as sender, recipient or date, and information about the body of several emails provided through a freedom of information requests in `.msg` format. 

In [None]:
# —————— libraries that need to be installed, which you can do via pip ———————

# import pdfplumber # to scrape pdfs, documentation: https://github.com/jsvine/pdfplumber
import pandas as pd # to use pandas to process data
import extract_msg # to extract messages

# —————— libraries built into Python ———————
import csv # to write and read csv
import glob # to access file paths

In [None]:
folder_name ="brookhaven"

paths = glob.glob("../data/"+folder_name+"/*.msg")

In [None]:
paths[0:5]

#### Extract data of messages based and structure into data
- extract `subject`, `date`, `sender` and `body` using `extract_msg` tools
- clean body (remove unicode symbols such as `\u200c` and `t`
- split body text into a list of information that repeats in each message


We will start by looking at one message:

In [None]:
msg = extract_msg.openMsg(paths[0])
print(msg)

Now we can use various methods from `msg-extractor` to examine the message, including:
- `msg.subject`
- `msg.date`
- `msg.sender`
- `msg.to`

The next two cells test extraction methods, meaning we will try to isolate ach part of the message into individual entries.

In [None]:
msg = extract_msg.openMsg(paths[0])

print(
    msg.subject,
    msg.date, 
    msg.sender,
    msg.to,
    msg.cc

)

The following cell prints the body text to understand what information is contained in each message.

In [None]:
msg.body

### Cleaning the body text and turning it into data
This part takes the body of the message and structures them into categories that repeat in each message. WE do this by taking the text of the body and splitting it into a list of strings, each separated by a spacing (denoted as `\r\n`.

In [None]:
msg_body_clean = msg.body.replace("\u200c","").replace("\t", "").replace("  ","").strip()
message_items = msg_body_clean.strip().split("\r\n")

In [None]:
list(filter(None, message_items))

In [None]:
message_items_clean  = list(filter(None, message_items))
len(message_items_clean)

In [None]:
{
    "body_title_top"           : message_items_clean[0].replace("Post Titled: ","").strip(),
    "body_link1_title"         : message_items_clean[2].strip(),
    "body_post_classification" : message_items_clean[3].strip(),
    "body_title"               : message_items_clean[4].strip(),
    "body_date"                : message_items_clean[5].strip(),
    "body_description"         : message_items_clean[6].strip(),
    "body_link2_title"         : message_items_clean[7].strip(),
    "body_link3_title"         : message_items_clean[9].strip(),

    
    
}

In [None]:
msg_data = []
for path in paths: 
    print(path)
    # open file
    msg = extract_msg.openMsg(path)
    # clean the message body from tabs and other formatting and split it into a list of items based on spacing
    msg_body_clean = msg.body.replace("\u200c","").replace("\t", "").replace("  ","").strip()
    message_items = msg_body_clean.split("\r\n")
    message_items_clean  = list(filter(None, message_items))

    
    # make a data dictionary that holds all information
    msg_info={
        "subject"      : msg.subject,
        "date"         : msg.date,
        "sender"       : msg.sender,
        "to"           : msg.to,
        "cc"           : msg.cc,
        "body_title_top"           : message_items_clean[0].replace("Post Titled: ","").strip(),
        "body_link1_title"         : message_items_clean[2].strip(),
        "body_post_classification" : message_items_clean[3].strip(),
        "body_title"               : message_items_clean[4].strip(),
        "body_date"                : message_items_clean[5].strip(),
        "body_description"         : message_items_clean[6].strip(),
        "body_link2_title"         : message_items_clean[7].strip(),
        "body_link3_title"         : message_items_clean[9].strip(),
        "body_full"    : msg.body.replace("\u200c","").replace("\r\n","").replace("\t",""),
         "file_name":  path
    }
    msg_data.append(msg_info)
    


In [None]:
msg_extracts = pd.DataFrame(msg_data)

msg_extracts

In [None]:
msg_extracts.to_csv("../output/msg_extracts.csv", index=False)