# Message Search
Searching iMessages on a Mac or iPhone is slow, taking 5-30s depending on what you're looking for. Once you've searched, cycling results is unintuitive - `command+G` and `command-shift-G` for reference - and seeing all results in one view is not possible. Using Python and Elasticsearch, we can solve both of these issues. 

Let's start by getting all of the imports out of the way, and giving little preview of things to come. 

In [1]:
from __future__ import print_function
import datetime
import sqlite3
from collections import namedtuple
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from tqdm import tqdm

## Working with the message database

Messages on a mac are stored in a file called `chat.db`, which is located in `~/Library/messages/` by default. To ensure nothing we do modifies this file, copy it. I'll be working on a duplicate stored in my current working directory.

In [2]:
PATH_TO_DB = 'chat.db' # file path to local copy of messages database

This is a sqlite3 database, for which Python convienently has built-in support. Let's connect and get back a cursor so we can start exploring this database a little bit.

In [3]:
def connect(path_to_db):        
    conn = sqlite3.connect(path_to_db)
    cursor = conn.cursor()
    return cursor

c = connect(PATH_TO_DB)

Digging into the tables first:

In [4]:
def tables(cursor):
        tables = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
        return [table[0] for table in tables if not table[0].startswith('_')]
print(tables(c))

['chat', 'sqlite_sequence', 'attachment', 'handle', 'chat_handle_join', 'message', 'chat_message_join', 'message_attachment_join', 'deleted_messages']


The `message` table looks promising, as does `handle`. Querying further about the schema's for each of these tables:

In [5]:
def schema(cursor, table_name):
    # the tuples returned by table_info have the name of the field in the 2nd position, index 1
    return [s[1] for s in cursor.execute("PRAGMA table_info('{}')".format(table_name))]

print('*'*10+'Message Schema'+'*'*10)
print(schema(c, 'message'))
print('*'*10+'Handle Schema'+'*'*10)
print(schema(c, 'handle'))

**********Message Schema**********
['ROWID', 'guid', 'text', 'replace', 'service_center', 'handle_id', 'subject', 'country', 'attributedBody', 'version', 'type', 'service', 'account', 'account_guid', 'error', 'date', 'date_read', 'date_delivered', 'is_delivered', 'is_finished', 'is_emote', 'is_from_me', 'is_empty', 'is_delayed', 'is_auto_reply', 'is_prepared', 'is_read', 'is_system_message', 'is_sent', 'has_dd_results', 'is_service_message', 'is_forward', 'was_downgraded', 'is_archive', 'cache_has_attachments', 'cache_roomnames', 'was_data_detected', 'was_deduplicated', 'is_audio_message', 'is_played', 'date_played', 'item_type', 'other_handle', 'group_title', 'group_action_type', 'share_status', 'share_direction', 'is_expirable', 'expire_state', 'message_action_type', 'message_source', 'associated_message_guid', 'balloon_bundle_id', 'payload_data', 'associated_message_type', 'expressive_send_style_id', 'associated_message_range_location', 'associated_message_range_length', 'time_expre

From `message`, it looks like `text`, `handle_id`, and `date` will be useful. Likewise it looks like from `handle` we really care about the `ROWID` (which is `handle_id` from the message table). This will allow us look up an actual identification for a contact.

## Transforming the message database to Python classes
At this point, we have a lot of data that we are going to want to associate by name. One approach would be to create a Message class, with properties for text, handle_id, and date, and a Handle class with a property for ROWID to id. Namedtuples provide an amazingly compact way to do this in Python.

In [6]:
def class_from_schema(cursor, table_name, class_name):
    s = schema(cursor, table_name)
    return namedtuple(class_name, s)

Message = class_from_schema(c, 'message', 'Message')
Handle = class_from_schema(c, 'handle', 'Handle')

Now, we can grab all the messages, and all the handles from the database:

In [7]:
def get_all(cursor, table_name, Class):
    all_results = cursor.execute("SELECT * FROM {}".format(table_name))
    return [Class(*result) for result in all_results]

MESSAGES = get_all(c, 'message', Message)
HANDLES = get_all(c, 'handle', Handle)

So, we have all the messages, and all the handles. Using the handles, we can create a lookup to use for the `handle_id` field of the messages, and use this to look a contact's identification:

In [8]:
HANDLE_TO_CONTACT = {handle.ROWID: handle.id for handle in HANDLES}

def lookup_contact(message, handle_to_contact):
    if message.handle_id in handle_to_contact:
        return handle_to_contact[message.handle_id]
    elif message.handle_id == 0:
        # iMessage uses a handle id of 0 to indicate a group chat
        return "group"
    else:
        # for some reason this handle is missing
        return message.handle_id

iMessage uses the `is_from_me` field to indicate whether it was a message you sent in a conversation, or the person you are speaking with. Let's use this to define a sender/receiver function:

In [None]:
def sender_receiver(message, handle_to_contact):
        contact = lookup_contact(message, handle_to_contact)
        if message.is_from_me:
            return "self", contact
        return contact, "self"
    
print(sender_receiver(MESSAGES[3], HANDLE_TO_CONTACT))

Timestamps always seem to be a little quirky to deal with. In this case, Apple's epoch starts on January 1, 2001, so use this to offset the message timestamps:

In [10]:
def timestamp(message, time_offset = 978307200):
        return datetime.datetime.fromtimestamp(int(message.date)+time_offset).strftime('%Y-%m-%d %H:%M:%S')
print(timestamp(MESSAGES[0]))

2014-08-26 12:05:52


## Feeding the messages to Elasticsearch

We've now got all the messages, we can look up senders and receivers, and the timestamps have been cleaned up. We're ready to feed this to Elasticsearch. Python makes it very natural to get things in the proper format to post. 

In [None]:
def index(message, handle_to_contact, _index, _type):
        sender, receiver = sender_receiver(message, handle_to_contact)
        return {
        '_index': _index,
        '_type': _type,
        '_source': {
            "text" : message.text,
            "timestamp" : timestamp(message),
            "sender" : sender,
            "receiver": receiver
        }
    }
print(index(MESSAGES[0], HANDLE_TO_CONTACT, 'messages', 'imessage'))

All that remains now is to post them! Elasticsearch's batch functionality takes care of this for us:

In [None]:
def post_all(messages, handle_to_contact, _index, _type):
    es = Elasticsearch()
    return bulk(es, (index(message, handle_to_contact, _index, _type) for message in messages))
post_all(MESSAGES, HANDLE_TO_CONTACT, 'messages', 'imessages')

## Searching messages

We are now rocking and rolling, we can query as we please, and it is WAY faster than the built-in search.

In [None]:
def search(term, index, field):
    es = Elasticsearch()
    res = es.search(index=index, body={"query": {"match": {field: term}}})
    print("Got %d Hits" % res['hits']['total'])
    for hit in res['hits']['hits']:
        print("%(timestamp)s: %(body)s" % hit["_source"])