# Analyze and Visualize your Gmail Inbox with Nomic Atlas

Nomic Atlas is a platform that makes it easy to integrate analyze unstructured data. 

In this notebook we upload data from a Gmail inbox to Atlas for easy sumarization and analysis of thousands of emails. With the [Atlas Pro plan](https://atlas.nomic.ai/pricing), you can store and analyze private datasets to benefit from all the Atlas features while maintaining 100% security over your data.

## Download Gmail Inbox

We recommend exporting your Gmail inbox using [Google Takeout](https://takeout.google.com/). 

With Google Takeout, you can export your Gmail inbox data in a format we will preprocess and send to Atlas in this notebook.

To download your inbox, go to [Google Takeout](https://takeout.google.com/), create a new export for Mail, and configure your download link to send your export to your preferred destination (e.g. Gmail, Google Drive, etc). 

You should receive a `.zip` or `.tgz` file which, when unzipped, contains an `.mbox` file, which we parse into a Pandas DataFrame using the following code:

## Parse Gmail Inbox from `.mbox` to Pandas DataFrame

In [14]:
!pip install -q beautifulsoup4 pandas

In [45]:
from bs4 import BeautifulSoup
from datetime import datetime
import email
from email.header import decode_header
import mailbox
import pandas as pd
import re

def decode_field(field):
    """Decode email header field."""
    if field is None:
        return ""
    decoded_parts = []
    for part, encoding in decode_header(field):
        if isinstance(part, bytes):
            if encoding:
                try:
                    decoded_parts.append(part.decode(encoding))
                except:
                    decoded_parts.append(part.decode('utf-8', errors='replace'))
            else:
                decoded_parts.append(part.decode('utf-8', errors='replace'))
        else:
            decoded_parts.append(part)
    return ' '.join(decoded_parts)

def extract_body(message):
    """Extract the body from the email message."""
    body = ""
    if message.is_multipart():
        for part in message.walk():
            content_type = part.get_content_type()
            content_disposition = str(part.get("Content-Disposition"))
            if "attachment" in content_disposition:
                continue
            if content_type == "text/plain":
                try:
                    payload = part.get_payload(decode=True)
                    charset = part.get_content_charset() or 'utf-8'
                    body = payload.decode(charset, errors='replace')
                    break  # Use first text/plain part
                except:
                    continue
            elif content_type == "text/html" and not body:
                try:
                    payload = part.get_payload(decode=True)
                    charset = part.get_content_charset() or 'utf-8'
                    html_body = payload.decode(charset, errors='replace')
                    soup = BeautifulSoup(html_body, 'html.parser')
                    body = soup.get_text(separator=' ', strip=True)
                except:
                    continue
    else:
        content_type = message.get_content_type()
        if content_type == "text/plain":
            try:
                payload = message.get_payload(decode=True)
                charset = message.get_content_charset() or 'utf-8'
                body = payload.decode(charset, errors='replace')
            except:
                body = message.get_payload()
        elif content_type == "text/html":
            try:
                payload = message.get_payload(decode=True)
                charset = message.get_content_charset() or 'utf-8'
                html_body = payload.decode(charset, errors='replace')
                soup = BeautifulSoup(html_body, 'html.parser')
                body = soup.get_text(separator=' ', strip=True)
            except:
                body = message.get_payload()
    return body

def parse_mbox(mbox_file):
    """Parse mbox file and return a DataFrame of emails."""
    data = []
    mbox = mailbox.mbox(mbox_file)
    for message in mbox:
        subject = decode_field(message['subject'])
        from_address = decode_field(message['from'])
        to_address = decode_field(message['to'])
        date_str = message['date']
        date = None
        if date_str:
            try:
                date = email.utils.parsedate_to_datetime(date_str).isoformat()
            except:
                if message['received']:
                    try:
                        received = message['received']
                        date_match = re.search(r'\d+\s+\w+\s+\d{4}\s+\d{2}:\d{2}:\d{2}', received)
                        if date_match:
                            date = datetime.strptime(date_match.group(0), '%d %b %Y %H:%M:%S').isoformat()
                    except:
                        pass
        message_id = message['message-id']
        thread_id = message.get('X-GM-THRID', None)
        if not thread_id: # As a fallback, use References or In-Reply-To
            thread_id = message.get('References', message.get('In-Reply-To', message_id))
        body = extract_body(message)
        labels_str = message.get('X-Gmail-Labels', '')
        label_dict = {
            'Inbox': False, 
            'Important': False, 
            'Opened': False, 
            'Unread': False, 
            'Archived': False, 
            'Trash': False, 
            'Spam': False,
            'Category_Updates': False,
            'Category_Personal': False,
            'Category_Promotions': False,
            'Category_Forums': False,
            'Category_Purchases': False,
            'IMAP_Forwarded': False
        }
        if labels_str:
            labels = labels_str.split(',')
            for label in labels:
                label = label.strip()
                if 'Category' in label:
                    category = label.replace('Category ', 'Category_').strip()
                    if category in label_dict:
                        label_dict[category] = True
                elif 'IMAP_$Forwarded' in label or 'IMAP_Forwarded' in label:
                    label_dict['IMAP_Forwarded'] = True
                elif label in label_dict:
                    label_dict[label] = True
        data.append({
            'message_id': message_id,
            'thread_id': thread_id,
            'date': date,
            'from': from_address,
            'to': to_address,
            'subject': subject,
            'body': body,
            **label_dict
        })
    return pd.DataFrame(data)

In [46]:
mbox_filepath = "/Users/max/Downloads/Takeout/Mail/gmail.mbox"
email_df = parse_mbox(mbox_filepath)
for c in email_df.columns:
    if email_df[c].dtype == bool:
        email_df[c] = email_df[c].astype(str)
email_df["id"] = list(range(len(email_df)))

In [47]:
email_df

Unnamed: 0,message_id,thread_id,date,from,to,subject,body,Inbox,Important,Opened,...,Archived,Trash,Spam,Category_Updates,Category_Personal,Category_Promotions,Category_Forums,Category_Purchases,IMAP_Forwarded,id
0,<010101961f92e1b9-42e7a208-2a78-44c3-8036-549d...,1829016900838009163,2025-04-10T12:00:41+00:00,Jacob Portes <usr-PKQmMdZV0mWW3A7@user.luma-ma...,max@nomic.ai,⏰ 🥯+🤖 AI Bagels Biotech Edition (with Bits in ...,🥯+🤖 AI Bagels Biotech Edition (with Bits in Bi...,True,True,True,...,False,False,False,False,False,True,False,False,False,0
1,<01000196086f6965-91734e84-326a-4776-ba72-9451...,1828609843013203201,2025-04-06T00:10:41+00:00,notifications <discussions_watched@notificatio...,max@nomic.ai,datasets/nomic-ai/VisRAG-Ret-Train-Synthetic-d...,New Discussion by Parquet-converter (BOT) (@pa...,False,False,False,...,False,False,True,True,False,False,False,False,False,1
2,<mid-01JRBVBZBFXNMV48ETAMHHG279@k3.send.voyage...,1828878611160173302,2025-04-08T23:22:36+00:00,"""Voyage AI"" <t@voyageai.com>",<max@nomic.ai>,Voyage AI Apologies for Recent Outages,"Dear Voyage AI users,\r\n\r\nWe want to extend...",True,True,True,...,False,False,False,True,False,False,False,False,False,2
3,<calendar-0b0a5627-10b7-4e15-a8c6-03e9afeac5fd...,1828764947622617319,2025-04-07T17:16:00+00:00,Wilson Lin <wilson.lin@nomic.ai>,"Max Cembalest <max@nomic.ai>, Sam Gildea <sam....","Invitation: Analyst v2 sync @ Mon Apr 7, 2025 ...","Analyst v2 sync\r\nMonday Apr 7, 2025 ⋅ 4pm – ...",True,True,True,...,False,False,False,False,True,False,False,False,False,3
4,<calendar-8628d39d-96d0-4e67-9fd0-60de72dc68dd...,1828223344248063511,2025-04-01T17:47:27+00:00,Brandon Duderstadt <brandon@nomic.ai>,"Max Cembalest <max@nomic.ai>, Zach Nussbaum <z...","Invitation: model launch @ Wed Apr 2, 2025 11:...","model launch\r\nWednesday Apr 2, 2025 ⋅ 11:45a...",True,True,True,...,False,False,False,False,True,False,False,False,False,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2487,<0100019432075ac4-d090ee14-35fb-4c4a-b48f-60f3...,1820334364614289913,2025-01-04T15:55:30+00:00,notifications <discussions_watched@notificatio...,max@nomic.ai,nomic-ai/modernbert-embed-base (#6) - Improve ...,New Pull Request by Joshua (@Xenova): #6 - Imp...,True,True,True,...,False,False,False,True,False,False,False,False,False,2487
2488,<1740520989881.26573a4a-06f4-464a-b18b-6be975b...,1825068578700455387,2025-02-25T14:03:10-08:00,Sebastian Lerner <sebastian@circleci.com>,max@nomic.ai,🛠 CircleCI's latest highlights: every pipeline...,Discover CircleCI's latest features for seamle...,False,False,True,...,True,False,False,True,False,False,False,False,False,2488
2489,<hme6LuKhR2GB4zP91k4YKA@geopod-ismtpd-10>,1818897433973778348,2024-12-19T19:10:49+00:00,Midjourney <feedback@midjourney.com>,max@nomic.ai,Midjourney Year in Review,New features and 30% off plans until the end\r...,False,False,True,...,True,False,False,True,False,False,False,False,False,2489
2490,<182c174ee44c3a4956ea850abd022018@broadcasts.l...,1823323000262094146,2025-02-06T15:38:33+00:00,Screen Studio <screenstudio@broadcasts.lemonsq...,max@nomic.ai,Screen Studio 3.0 has launched on Product Hunt!,Hey!\r\n\r\nWe've just launched Screen Studio ...,False,False,True,...,True,False,False,False,False,True,False,False,False,2490


## Upload Data to Nomic Atlas

Once you have your emails in a dataframe with the features you want to include, you can create a new dataset in Atlas and upload your data to the platform for visualization and analysis. Make sure you have the Nomic Python SDK installed and that you login with your [Nomic API Key](https://atlas.nomic.ai/cli-login).

In [48]:
!pip install -q nomic

In [49]:
!nomic login nk-...

In [53]:
from nomic import AtlasDataset

dataset_identifier = "gmail-inbox" # to create the dataset in the organization connected to your Nomic API key
# dataset_identifier = "<ORG_NAME>/gmail-inbox" # to create the dataset in other organizations you are a member of

atlas_dataset = AtlasDataset(dataset_identifier, unique_id_field="id")

[32m2025-04-10 15:42:16.498[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m867[0m - [1mOrganization name: `nomic`[0m
[32m2025-04-10 15:42:17.059[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m895[0m - [1mCreating dataset `gmail-inbox`[0m


In [54]:
add_data_result = atlas_dataset.add_data(email_df)

1it [00:01,  1.36s/it]
[32m2025-04-10 15:42:21.036[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1714[0m - [1mUpload succeeded.[0m


This code snippet sets `body` as the column from your data with the semantic information used when creating embeddings for your data map in Atlas. This means that emails with similar text language in the `body` will cluster together in your data map in Atlas.

In [56]:
create_index_result = atlas_dataset.create_index(indexed_field="body")

[32m2025-04-10 15:42:33.775[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1301[0m - [1mCreated map `01962139-b798-06a6-0915-2ca12d51dcf0` in dataset `nomic/gmail-inbox`: https://atlas.nomic.ai/data/nomic/gmail-inbox[0m
