## Prerequisite for This Program: A Mbox File with Emails

## How to Export Gmail into a Mbox File

I choose to export only email threads in which I participated to remove spam and general announcements. I also choose to export in chunks of 6 months to keep the file sizes manageable. Here's how I do it:

### Selecting chunks of conversations in Sent

1. Search gmail for "in:sent after:2019-01-01 before:2019-07-01". The phrase "in:sent" selects only email threads in which I participated. The phrases "after:2019-01-01 before:2019-07-01" selects only email threads in that date range.

2. Check the checkbox to select all conversations in the search result, then click on "Select all conversations that match this search".

3. Label all these conversations with something like "in-sent-after-2019-01-01-before-2019-07-01"

### Exporting Gmail (Optionally: only emails with a specific label)

4. Go to [https://takeout.google.com/](https://takeout.google.com/)

5. Click "Deselect all", then find and check "Mail: Messages and attachments in your Gmail account in MBOX format." to only export emails, not all your Google data.

6. Click "All Mail data included", then deselect "Include all messages in Mail", then click "Select all", then "Deselect all". Now, find and check your label "in-sent-after-2019-01-01-before-2019-07-01".

7. Find and click "Next Step" to choose the file type, frequency & destination.

8. Under "Transfer to:", I choose "Add to Drive", but you could also "Send download link via email" or use another option.

9. Click "Create export".

10. Once the export is created, you can click "Open in Drive", then "Open with ZIP Extractor". Extraction will create a .MBOX file. Be mindful where these files are placed in your Drive; they might wind up buried in "My Drive/Takeout/Mail".

11. You should now have a MBOX file that can be processed with this program!

## Mount to Google Drive and show Drive's mbox files

In [41]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
print()

mbox_files = []
mbox_json_files = []
mbox_txt_files = []

import os
for x in os.listdir('/content/drive/MyDrive'):
    if x.endswith('mbox'):
        mbox_files.append(x)
    if 'mbox' in x and x.endswith('json'):
        mbox_json_files.append(x)
    if 'mbox' in x and x.endswith('txt'):
        mbox_txt_files.append(x)

print("Mbox files")
for x in sorted(mbox_files):
    print(x)
print()

print("Mbox json files")
for x in sorted(mbox_json_files):
    print(x)
print()

print("Mbox txt files")
for x in sorted(mbox_txt_files):
    print(x)

Mounted at /content/drive

Mbox files
SMS-2014-2016.mbox
greg-egan-two-emails (1).mbox
in-sent-after-2016-01-01-before-2016-07-01 (1).mbox
in-sent-after-2016-07-01-before-2016-12-31.mbox
in-sent-after-2017-01-01-before-2017-07-01.mbox
in-sent-after-2017-07-01-before-2017-12-31.mbox
in-sent-after-2018-01-01-before-2018-07-01.mbox
in-sent-after-2018-07-01-before-2018-12-31.mbox
in-sent-after-2019-01-01-before-2019-07-01.mbox
in-sent-after-2019-07-01-before-2019-12-31.mbox
in-sent-after-2020-01-01-before-2020-07-01.mbox
in-sent-after-2020-07-01-before-2020-12-31.mbox
in-sent-after-2021-01-01-before-2021-07-01.mbox
in-sent-after-2021-07-01-before-2021-12-31.mbox
in-sent-after-2022-01-01-before-2022-07-01.mbox
in-sent-after-2022-07-01-before-2022-12-31.mbox
in-sent-after-2023-01-01-before-2023-07-01.mbox
in-sent-after-2023-07-01-before-2024-01-12.mbox
isaac-mackey-Sent-20240109-001.mbox

Mbox json files
in-sent-after-2023-07-01-before-2024-01-12.mbox-2.json
in-sent-after-2023-07-01-before-2

## Check a specific file

In [42]:
file_name = "in-sent-after-2016-01-01-before-2016-07-01.mbox"
path = '/content/drive/My Drive'
if file_name in os.listdir(path):
    print("File found")
else:
    print("File not found")

File not found


## Helper functions to read mbox

In [43]:
def print_current_time():
    from datetime import datetime, timedelta
    # Format the current date and time in a human-readable format
    print((datetime.now()-timedelta(hours=5)).strftime("%Y-%m-%d %H:%M:%S"))

def UNIX_timestamp_to_formatted_datetime(date):
    unix_timestamp = int(date)  # Convert to integer and then to seconds
    date_time_obj = datetime.utcfromtimestamp(unix_timestamp)
    # Format the datetime object as a string
    formatted_date = date_time_obj.strftime('%Y-%m-%d %H:%M:%S')
    return formatted_date

def getcharsets(msg):
    charsets = set({})
    for c in msg.get_charsets():
        if c is not None:
            charsets.update([c])
    return charsets

def getBody(msg):
    while msg.is_multipart():
        msg=msg.get_payload()[0]
    t=msg.get_payload(decode=True)
    for charset in getcharsets(msg):
        t=t.decode(charset)
    return t

def handleerror(errmsg, emailmsg,cs):
    print()
    print(errmsg)
    print("This error occurred while decoding with ",cs," charset.")
    print("These charsets were found in the one email.",getcharsets(emailmsg))
    print("This is the subject:",emailmsg['subject'])
    print("This is the sender:",emailmsg['From'])

def getbodyfromemail(msg):
    body = None
    #Walk through the parts of the email to find the text body.
    if msg.is_multipart():
        for part in msg.walk():

            # If part is multipart, walk through the subparts.
            if part.is_multipart():

                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        # Get the subpart payload (i.e the message body)
                        body = subpart.get_payload(decode=True)
                        #charset = subpart.get_charset()

            # Part isn't multipart so get the email body
            elif part.get_content_type() == 'text/plain':
                body = part.get_payload(decode=True)
                #charset = part.get_charset()

    # If this isn't a multi-part message then get the payload (i.e the message body)
    elif msg.get_content_type() == 'text/plain':
        body = msg.get_payload(decode=True)

   # No checking done to match the charset with the correct part.
    for charset in getcharsets(msg):
        try:
            body = body.decode(charset)
        except UnicodeDecodeError:
            continue
            # handleerror("UnicodeDecodeError: encountered.",msg,charset)
        except AttributeError:
            continue
            # handleerror("AttributeError: encountered" ,msg,charset)

        # body = body.decode('utf-8', errors='replace')

    return body

from datetime import datetime
import time
from dateutil import parser

def email_datetime_str_to_unix_timestamp(datetime_str):

    # The provided datetime string
    example_datetime_str = "Sat, 16 Dec 2023 18:59:31 -0500"

    if not datetime_str:
        datetime_str = 1704826771

    if type(datetime_str) == int:
        return datetime_str

    if datetime_str[-6:] in [" ("+x+")" for x in ["UTC", "PDT", "GMT", "PST", "EST", "EDT", "MDT"]]:
        datetime_str = datetime_str[:-6]

    if datetime_str[-12:] in [" ("+x+")" for x in ["GMT+01:00", "GMT+08:00"]]:
        datetime_str = datetime_str[:-12]

    if datetime_str[3] == ',':
        datetime_str = datetime_str[5:]

    # Parse the date string into a datetime object
    dt = parser.parse(datetime_str)

    # Convert to Unix timestamp
    unix_timestamp = int(time.mktime(dt.timetuple()))

    return unix_timestamp

## Designate mbox file name to extract

In [191]:
mbox_names = [
    "in-sent-after-2016-07-01-before-2016-12-31.mbox",
    "in-sent-after-2017-01-01-before-2017-07-01.mbox",
    "in-sent-after-2017-07-01-before-2017-12-31.mbox",
    "in-sent-after-2018-01-01-before-2018-07-01.mbox",
    "in-sent-after-2018-07-01-before-2018-12-31.mbox",
    "in-sent-after-2019-01-01-before-2019-07-01.mbox",
    "in-sent-after-2019-07-01-before-2019-12-31.mbox",
    "in-sent-after-2020-01-01-before-2020-07-01.mbox",
    "in-sent-after-2020-07-01-before-2020-12-31.mbox",
    "in-sent-after-2021-01-01-before-2021-07-01.mbox",
    "in-sent-after-2021-07-01-before-2021-12-31.mbox",
    "in-sent-after-2022-01-01-before-2022-07-01.mbox",
    "in-sent-after-2022-07-01-before-2022-12-31.mbox",
    "in-sent-after-2023-01-01-before-2023-07-01.mbox",
    "in-sent-after-2023-07-01-before-2024-01-12.mbox",
    "greg-egan-two-emails.mbox"
]

mbox_name = mbox_names[0]

print(f"mbox_name: {mbox_name}")

# Path to your mbox file
mbox_path = '/content/drive/My Drive/'+mbox_name

import mailbox

# Open the mbox file
mbox = mailbox.mbox(mbox_path)

mbox_name: in-sent-after-2016-07-01-before-2016-12-31.mbox


## Save emails into a simpler data structure

In [179]:
emails = []
thread_ids = []

i = 0
limit = 10000

# Iterate through messages
for message in mbox:
    # print(message)
    # break

    try:
        # Extracting basic headers
        subject = message['subject']
        sender = message['from']
        receiver = message['to']
        date = message['date']
        thread_id = message['X-GM-THRID']

        # Accessing the body of the email
        body_text = getbodyfromemail(message)

        if len(body_text) > 4096:
            body_text = body_text[:4096]

        if body_text[:18] == 'Total SMS messages':
            body_text = 'Total SMS messages'

        if not body_text:
            continue

        if not thread_id in thread_ids:
            thread_ids.append(thread_id)

        json_message = {
            'subject': subject,
            'sender': sender,
            'receiver': receiver,
            'date': email_datetime_str_to_unix_timestamp(date),
            'message': body_text,
            'thread_id': thread_id
        }

        emails.append(json_message)
    except:
        continue
    i += 1
    if i > limit:
        break

# emails = sorted(emails, key=lambda x: x['date'])

print(f"Number of emails: {len(emails)}")
print(f"Number of threads: {len(thread_ids)}")

Number of emails: 1932
Number of threads: 603


## Check that timestamp decoding is working

In [181]:
for i,x in enumerate(emails):
    # Unix timestamp to be converted
    timestamp = emails[i]['date']

    print(timestamp)

    print(UNIX_timestamp_to_formatted_datetime(timestamp))
    print()

    if i > 2:
        break

1656852029
2022-07-03 12:40:29

1656578955
2022-06-30 08:49:15

1695686942
2023-09-26 00:09:02

1645786228
2022-02-25 10:50:28



## Inspect email data structure

In [None]:
for x in emails[:3]:
    print(x)

## Clean up sender, receiver, and message

In [183]:
good_emails = []
receiver_names = []
sender_names = []

for e in emails:
    try:
        subject = e['subject']
        sender = e['sender']
        receiver = e['receiver']
        date = e['date']
        message = e['message']
        thread_id = e['thread_id']

        # print(subject)
        # print(sender)
        # print(receiver)
        # print(date)
        # print(message)
        # print()

        if not receiver:
            continue

        sender_name = sender.split('<')[0].strip()
        sender_names.append(sender_name)

        receiver_name = receiver.split('<')[0].strip()

        email_receiver_names = []
        for n in receiver.split(','):
            if '<' in n:
                receiver_names.append(receiver_name)
                email_receiver_names.append(receiver_name)
            else:
                receiver_names.append(n)
                email_receiver_names.append(n)

        if message:
            if '\nOn' in message:
                message = message.split('\nOn')[0].strip()

            email_normalized = message.replace('\r\n', '\n')
            lines = email_normalized.split('\n')

            # Check if the last line contains the first or last name to remove sender signature
            sender_names = sender_name.split(' ')

            last_line = lines[-1] if lines else ""

            while not last_line and lines:
                lines = lines[:-1]  # Remove the last line
                last_line = lines[-1] if lines else ""

            if any([x in last_line for x in sender_names]) and lines:
                lines = lines[:-1]  # Remove the last line

                last_line = lines[-1] if lines else ""

                # Check if the last line contains one of the following to remove the parting words
                parting_words = ["Best", "Thanks", "Sincerely", "Regards", "Warm regards",
                "Kind regards", "Best regards", "Respectfully", "Cheers", "Take care",
                "All the best", "Many thanks"]

                if any([x in last_line for x in parting_words]) and lines:
                    lines = lines[:-1]  # Remove the last line

            message = ' '.join(lines)

            message = message.replace('  ',' ')

            json_message = {
              'subject': subject,
              'sender': sender_name,
              'receiver': receiver_name,
              'date': date,
              'message': message,
              'thread_id': thread_id
            }

            good_emails.append(json_message)

    except Exception as x:
        errnum = x.args[0]
        print(e)
        print(errnum)
        print()

In [184]:
print(f"Number of emails: {len(emails)}")
print(f"Number of good emails: {len(good_emails)}")

Number of emails: 1932
Number of good emails: 1931


## Group into Threads

In [185]:
threads = []

for t in thread_ids:
    thread = []
    for e in good_emails:
        if e['thread_id'] == t:
            thread.append(e)
    thread = sorted(thread, key=lambda x: x['date'])
    if thread:
        threads.append(thread)

## Check correct grouping of threads

In [None]:
i = 0
limit = 3
for t in threads[:10]:
    if len(t) < 3:
        continue
    print(t[0]['subject'])
    print(UNIX_timestamp_to_formatted_datetime(t[0]['date']))
    for e in t:
        print('sender:',e['sender'],',',e['message'])
        print()
    print()

## Helper Functions to Write Emails into TXT File

In [187]:
from datetime import datetime

def UNIX_timestamp_to_formatted_datetime(date):
    unix_timestamp = int(date)  # Convert to integer and then to seconds
    date_time_obj = datetime.utcfromtimestamp(unix_timestamp)
    # Format the datetime object as a string
    formatted_date = date_time_obj.strftime('%Y-%m-%d %H:%M:%S')
    return formatted_date

def write_threads_to_txt_file(output_file_path, threads):
    if not 'txt' in output_file_path:
        print("txt not in",output_file_path)
        return

    # Writing the output to a text file
    with open(output_file_path, 'w') as file:
        print("writing to output_file_path:", output_file_path)

        num_emails = 0
        contacts = []
        all_thread_contacts = []
        earliest = threads[0][0]['date']
        latest = threads[0][0]['date']

        for t in threads:
            thread_contacts = []
            for e in t:
                contacts.append(e['sender'])
                contacts.append(e['receiver'])
                thread_contacts.append(e['sender'])
                thread_contacts.append(e['receiver'])
                num_emails += 1
                message_date = e['date']
                if e['date'] < earliest:
                    earliest = message_date
                if e['date'] > latest:
                    latest = message_date
            all_thread_contacts.append(list(set(thread_contacts)))

        contacts = list(set(contacts))

        file.write('Total emails: ' + str(num_emails) + '\n')
        file.write('Contacts found: ' + str(len(contacts)) + '\n')
        file.write('Earliest message: ' + UNIX_timestamp_to_formatted_datetime(earliest) + '\n')
        file.write('Latest message: ' + UNIX_timestamp_to_formatted_datetime(latest) + '\n')

        file.write('\n')

        for i,t in enumerate(threads):
            thread_contacts = all_thread_contacts[i]
            max_length = max(len(x) for x in thread_contacts)

            file.write('Conversation with ' + ', '.join(thread_contacts) + '\n')
            padded_contact_names = {x: x.ljust(max_length) for x in thread_contacts}

            file.write(str(len(t)) + " emails" + '\n')

            date_sorted_emails = sorted(t, key=lambda x: x['date'])

            earliest = date_sorted_emails[0]['date']
            latest = date_sorted_emails[-1]['date']

            file.write('Earliest message: ' + UNIX_timestamp_to_formatted_datetime(earliest) + '\n')
            file.write('Latest message: ' + UNIX_timestamp_to_formatted_datetime(latest) + '\n')

            # Iterate over messages
            for e in t:
                # Format the datetime object as a string
                formatted_date = UNIX_timestamp_to_formatted_datetime(e['date'])
                file.write(formatted_date + ": " + padded_contact_names[e['sender']] + ": "+e['message'] + '\n')
            file.write('\n')

    print('File closed')

## Check target mbox name

In [188]:
mbox_name

'in-sent-after-2022-01-01-before-2022-07-01.mbox'

## Write Emails into TXT File

In [189]:
print_current_time()

txt_suffix = '-test-2.txt'
txt_name = mbox_name + txt_suffix
txt_path = mbox_path + txt_suffix

print('txt_path:',txt_path)

# Print the number of emails
print('Total emails:', len(good_emails))

if txt_name in os.listdir('/content/drive/My Drive'):
    print("File already exists")
else:
    write_threads_to_txt_file(txt_path,threads)

2024-05-10 19:44:04
txt_path: /content/drive/My Drive/in-sent-after-2022-01-01-before-2022-07-01.mbox-test-2.txt
Total emails: 1931
writing to output_file_path: /content/drive/My Drive/in-sent-after-2022-01-01-before-2022-07-01.mbox-test-2.txt
File closed


## Helper Functions to Create Role/System/User/Assistant JSON files

In [39]:
def write_threads_to_role_system_user_format_json(output_file_path, threads):
    if not 'json' in output_file_path:
        print("json not in",output_file_path)
        return

    # Writing the output to a text file
    with open(output_file_path, 'w') as file:
        print("writing to output_file_path:", output_file_path)

        for t in threads:
            if len(t) < 2:
                continue

            conversation = []

            system_message = ("Be polite and formal. Do not apologize. Use correct grammar and avoid logic fallacies.")
            conversation.append({"role": "system", "content": system_message})

            assistant_present = False

            # Iterate over messages
            for e in t:
                sender_name = e['sender']

                if "isaac" in e['sender'].lower():
                    role = "assistant"
                    assistant_present = True
                else:
                    role = "user"
                conversation.append({"role": role, "content": e['message']})

                # conversation.append({"role": "separator", "content": "<END_OF_CONVERSATION>"})
            if assistant_present:
                json_record = json.dumps({'messages': conversation})
                file.write(json_record + '\n')

## Create Role/System/User/Assistant JSON file

In [None]:
print_current_time()

json_suffix = '-5.json'
json_file = mbox_name + json_suffix
json_path = mbox_path + json_suffix

print('json_path:',json_path)

# Print the number of emails
print('Total emails:', len(good_emails))

if json_file in os.listdir('/content/drive/My Drive'):
    print("File already exists")
else:
    write_threads_to_role_system_user_format_json(json_path,threads)

2024-01-15 20:06:43
json_path: /content/drive/My Drive/in-sent-after-2023-07-01-before-2024-01-12.mbox-5.json
Total emails: 1291
writing to output_file_path: /content/drive/My Drive/in-sent-after-2023-07-01-before-2024-01-12.mbox-5.json


# Dump threads into raw json file for easy recovery

In [None]:
print_current_time()

import json

# File path where JSON data is saved
raw_json_suffix = "-raw-2.json"

raw_json_file = mbox_name + raw_json_suffix
raw_json_path = mbox_path + raw_json_suffix

print('raw_json_path:',raw_json_path)

if raw_json_file in os.listdir('/content/drive/MyDrive'):
    print("File already exists found")
else:
    with open(raw_json_path, 'w') as file:
        json.dump(threads, file)

2024-01-15 20:00:16
raw_json_path: /content/drive/My Drive/in-sent-after-2023-07-01-before-2024-01-12.mbox-raw-2.json


In [None]:
raw_json_file = mbox_name + "-raw-1.json"
raw_json_path = mbox_path + "-raw-1.json"

print('raw_json_file:',raw_json_file)

if not raw_json_file in os.listdir('/content/drive/My Drive'):
    print("File doesn't exist")
else:
    print("File found")
    with open(raw_json_path, 'r') as file:
        threads_2 = json.load(file)

raw_json_file: in-sent-after-2023-07-01-before-2024-01-12.mbox-raw-1.json


In [None]:
print(len(threads))
print(len(threads_2))

79
79
