# Preprocessing Enron Dataset
*Convert the current csv file to a json file with a structure ready for visualisation.*

Each line in the csv file is a (message, recipient) combination, assuming a message does not contain duplicate recipients. I want to have a hierarchical structure:
```
array of threads: [thread_1, ..., thread_n], each thread has threadId, and
array of messages: [message_1, ..., message_m], each message has messageId, subject, sender, times, and 
array of recipients: [rec_1, ..., rec_k], each recipient has email and type

[
    { 
        threadId: e94a22508dac953,
        messages: [
            {
                messageId: e94a22508dac953,
                subject: FW: LINE SM-123,
                sender: victor.lamadrid@enron.com,
                time: 2001-10-01T14:19:03-07:00,
                recipients: [
                    { email: john.hodge@enron.com, type: to },
                    { email: john.singer@enron.com, type: cc },
                    { email: scott.neal@enron.com, type: bc },
                    { email: clarissa.garcia@enron.com, type: to }
                ]
            }
        ]
    }
]
```

In [1]:
input_file = 'enronThread2001.csv'
output_file = 'enronThread2001.json'

In [2]:
import pandas as pd
import pprint
import json

In [3]:
df = pd.read_csv(input_file)
df.head()

Unnamed: 0,TID,MID,SUBJECT,FROM,TIMESTAMP,TO,TYPE
0,e94a22508dac953,e94a22508dac953,"""FW: LINE SM-123""",victor.lamadrid@enron.com,2001-10-01T14:19:03-07:00,john.hodge@enron.com,TO
1,e94a22508dac953,e94a22508dac953,"""FW: LINE SM-123""",victor.lamadrid@enron.com,2001-10-01T14:19:03-07:00,john.singer@enron.com,TO
2,e94a22508dac953,e94a22508dac953,"""FW: LINE SM-123""",victor.lamadrid@enron.com,2001-10-01T14:19:03-07:00,scott.neal@enron.com,TO
3,e94a22508dac953,e94a22508dac953,"""FW: LINE SM-123""",victor.lamadrid@enron.com,2001-10-01T14:19:03-07:00,clarissa.garcia@enron.com,TO
4,e94a22508dac953,e94a22508dac953,"""FW: LINE SM-123""",victor.lamadrid@enron.com,2001-10-01T14:19:03-07:00,chris.germany@enron.com,TO


In [4]:
def make_thread(tid, df):
    'Return data representing a thread.'
    message_groups = df[df['TID'] == tid].groupby('MID').groups
    return {
        'threadId': tid,
        'messages': [make_message(mid, indices, df) for mid, indices in message_groups.items()]
    }

def make_message(mid, indices, df):
    return {
        'messageId': mid.strip(),
        'subject': df['SUBJECT'][indices[0]].strip().replace('\"', ''),
        'sender': df['FROM'][indices[0]].strip(),
        'time': df['TIMESTAMP'][indices[0]].strip(),
        'recipients': [make_recipient(idx, df) for idx in indices]
    }

def make_recipient(index, df):
    return {
        'email': df['TO'][index],
        'type': df['TYPE'][index]
    }

In [5]:
threads = [make_thread(tid, df) for tid in df.groupby('TID').groups.keys()]
with open(output_file, 'w') as f:
    json.dump(threads, f)

In [12]:
threads_10 = sorted(threads, key=lambda x: -len(x['messages']))[:10]
with open('threads-10.json', 'w') as f:
    json.dump(threads_10, f)