# Enron E-mail dataset

This notebook anlyses the Enron email dataset, downloaded from: <br>
https://www.cs.cmu.edu/~./enron/
<br>
The dataset contains emails from Enron employees released as part of a US government investigation.
<br><br>
This notebook covers the following:
- Carry out an initial analysis of the dataset.
- Clean the dataset and remove any redundant data.
- Gain statistical insights from the data.


The insights section will focus on:
- basic anlysis of the email metadata (who sent what and when)
- clustering email text: can we use basic K-means clustering to identify groups of emails 
    - this can form the basis for labelling data for further anlysis
- topic analysis: can we extract topics for different groups of emails
    - this can be used to analyse what users are sending

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import glob
import os
import time
from pathlib import Path

# import external packages
import sys
cwd = Path(os.getcwd())
sys.path.append(str(cwd.parent))
from data_utils.data_extraction import (
    get_metadata,
    get_email_body,
    strip_string,
    stem_text,
    remove_named_entities,
    extract_message_nouns
)
from data_utils.model_fitting import fit_tf_idf

In [None]:
# path to the dataset (set to correct path for the user)
data_path = Path("/Users/matthew/tmp/feedstock/maildir")

## Initial analysis
The aim of this section is to understand the dataset (e.g. what it contains, how the data is stored and formatted).

### Data storage
The dataset contains emails from 150 users. The username is used as a directory in the first layer of data storage.

In [None]:
users = [os.path.basename(u) for u in glob.glob(str(data_path / "*"))]
print(f"Number of users: {len(users)}")
print("")
print("10 random users:")
print("-" * len("10 random users:"))
print("\n".join(sorted(np.random.choice(users, 10))))

The emails are stored in the following format:<br><br>
*{user_name}/{email_folder}/{email_subfolder}/{email_name}*
<br><br>
- __email_folder__: the main folder the email is stored in, e.g. __inbox__, __sent__ etc.
- __email_subfolder__: there may be no sub-folder or more than 1.
- __email_name__: this is a number with a full-stop at the end, giving a count of emails in each folder/sub-folder (e.g. 1., 2. 3. etc.)
<br><br>

Below we store the location of each email in dataframe, using the following columns. Using this, we can read in any email by reconstructing the path to it for a given index value of the dataframe.
- __user__: user name of the email sender.
- __path__: path to the email from the email sender directory, which will be {email_folder}/{email_subfolder}
- __fname__: file name for the email, e.g. 1., 2., 3. etc.

In [None]:
df = pd.DataFrame(columns=["user", "path", "fname"])

for user in users:
    # email files end in .
    files = glob.glob(str(Path(data_path) / user / "**" / "*."), recursive=True)
    # find the filename
    fnames = [os.path.basename(f) for f in files]
    # find the path from data_path
    paths = [os.path.dirname(f) for f in files]
    paths = [p.split(f"{data_path}/{user}/")[-1] if p != f"{data_path}/{user}" else "" for p in paths]
    df_sub = pd.DataFrame(
        columns=["user", "path", "fname"],
        data=zip([user] * len(files), paths, fnames)
    )
    df = pd.concat([df, df_sub], axis=0)
df = df.sort_values(["user", "path", "fname"]).reset_index(drop=True)
display(df)

### Overview of email senders
Of the 150 users, there are large differences in the number of emails sent/received.<br><br>

In [None]:
print(f"Total emails in dataset: {len(df)}")
print("")
print("-" * 50)
print("Emails per user (absolute values):")
vc = df["user"].value_counts()
display(vc.describe())
print(f"IQR:               {round(np.quantile(vc.values, 0.75) - np.quantile(vc.values, 0.25), 2)}")
print(f"Middle 95% range:  {round(np.quantile(vc.values, 0.975) - np.quantile(vc.values, 0.025), 2)}")
print("")

print("-" * 50)
print("Emails per user (percentage values):")
vc = df["user"].value_counts() * 100.0 / len(df)
display(vc.describe())

print("-" * 50)
print("Users responsible for most emails (as a %):")
display(vc.head())

In [None]:
plt.figure(figsize=(12, 4))
plt.bar(range(len(vc)), vc.values)
plt.title("Percentage of all emails sent+received by each user", fontsize=12)
plt.ylabel("% of emails", fontsize=12)
plt.xlabel("User rank", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

### Email storage
- Most emails are stored in default folders (e.g. all_documents, inbox, sent).
- Many folders have a single email in them.
- Some non-default folder names have many emails e.g. bill_williams_iii.

- all_documents needs to be checked that it does not duplicate other folders.
- Some folders can be identified as sent or received, others are not so clear.

In [None]:
vc = df["path"].value_counts()
print(f"Number of unique folders (including sub-folders): {len(vc)}")
print("")
print("Number of emails in each folder:")
display(vc.head(20))
display(vc.tail(20))

In [None]:
# potential folders for sent emails
sent_folders = sorted(
    [v for v in vc.index if ("present" not in v.lower()) and ("sent" in v.lower()) or ("outb" in v.lower())]
)
n_sent = vc[sent_folders].sum()
n_rec = vc[~vc.index.isin(sent_folders)].sum()
n_tot = vc.sum()
print(f"Number of sent emails:     {n_sent} ({round(100.0 * n_sent / n_tot)}%)")
print(f"Number or received emails: {n_rec} ({round(100.0 * n_rec / n_tot)}%)")
print("")
print("\n".join(sent_folders))

### Extracting email meta-data
The metadata for the emails is stored at the top of the email text. The format is:
<br><br>
metadata_key_1: metadata_value_1<br>
metadata_key_2: metadata_value_2<br>
...<br>
...<br>
metadata_key_N: metadata_value_N<br>
<br><br>
These are extracted from the email body and stored in a dataframe:
- columns: metadata keys
- index: email index, as in df
- values: the metadata values

#### Metadate columns

The metadata has the following columns (with our data for user, path and fname added).

- __Message-ID__: Unique id for the email.
- __Mime-Version__: Stands for "Multipurpose Internet Mail Extension"
- __Date__: Date that the email was sent, with timezone.
- __From__: Sender of the email, not necessarily the same
- __To__: Receiver of the email.
- __Subject__: Subject line of the email.
- __Content-Type__: Encoding of the email body text.
- __Content-Transfer-Encoding__: Mechanism for re-encoding data for sending the email.
- __X-From__: Name of the sender (with email address if external)
- __X-To__: Name of the recipient, with email address in some cases (mostly external)
- __X-cc__: List of CC'd recipients.
- __X-bcc__: List of BCC'd recipients.
- __X-Folder__: Storage folder of the email.
- __X-Origin__: Origin of the email, not necessarily the same as the user, could be due to errors in storing the emails.
- __X-FileName__: Notes Storage Facility filename (used by IBM Notes to store email data).
- __Cc__: List of CC'd recipients, sometimes different from X-cc.
- __Bcc__: List of BCC'd recipients, seems to be the same as CC.
- __user__: Username, from the folder the emails are stored in.
- __path__: Path the email is stored in, from the user folder (e.g. username/path/1.)
- __fname__: Filename of the email, stored as 1., 2., 3.

In [None]:
# read in all the metadata, using get_metadata function

start_time = time.time()

contents_meta_all = {}

# extract metadata for a sample of N emails (use all or a sub-sample)
# Note: it can take ~30mins to read in all the metadata (for ~500,000 emails)
N = len(df)
if N < len(df):
    inds = sorted(np.random.choice(range(len(df)), N, replace=False))
else:
    inds = list(df.index)

for i, ind in enumerate(inds):
    fname = os.path.join(data_path, "/".join(df.loc[ind, ["user", "path", "fname"]].values))
    with open(fname, "r", errors='replace') as f:
        contents = f.readlines()
        try:
            contents_meta = get_metadata(contents)
        except ValueError:
            print(f"{i}: No meta data for {fname}")
            contents_meta = {}
        contents_meta_all[ind] = contents_meta
    if np.remainder((i + 1), int(N / 100)) == 0:
        print(f"complete: {round(100.0 * (i + 1) / N, 2)}% in {round(time.time() - start_time, 2)}s")
df_meta = pd.DataFrame.from_dict(contents_meta_all, orient="index")
print(time.time() - start_time)
display(df_meta.sample(n=20))
# cleanup
del contents_meta_all

Make sure no rows are duplicated

In [None]:
print(f"Number of duplicated rows: {df_meta.duplicated().sum()}")

We now check for redundant columns. That is, those where all values are the same, or that duplicate other columns)

In [None]:
# check for redundant columns

# - Message-ID we can drop, each is unique and we use the integer index as a unique ID
# - Mime-Version (Multipurpose Internet Mail Extensions) is always the same
# - X-bcc has a less than 1% non-empty values
print("column name".ljust(30), "num. unique vals".ljust(25), "% unique vals")
for col in df_meta.columns:
    print(
        col.ljust(30),
        str(len(df_meta[col].unique())).ljust(25),
        round(100.0 * len(df_meta[col].unique()) / len(df_meta), 4))

print("\n")
print("Columns with few unique values:")
for col in [col for col in df_meta.columns if len(df_meta[col].unique()) < 10]:
    print("\n")
    print("-" * 50)
    print(col)
    display(df_meta[col].value_counts())
    

#### Missing data
Some columns have missing data. In some cases, metadata was not present for an email (Cc, Bcc). In other cases, the key was present, but there is value, and it is stored here as an empty string.

In [None]:
# To is missing for some emails
# Cc and Bcc have missing values, and seem to be the same overall
display(df_meta.loc[:, df_meta.isnull().sum(axis=0) != 0].isnull().sum(axis=0))

# Bcc and Cc both missing is the same as when either is missing
print(f"Num where Bcc and Cc are both NaN:    {df_meta[['Cc', 'Bcc']].isnull().all(axis=1).sum()}")
# Bcc and Cc not missing, have the same value 100% of the time
inds = df_meta[['Cc', 'Bcc']].notnull().all(axis=1)
print(f"% of entries where Bcc = Cc != NaN:   {100.0 * (df_meta.loc[inds, 'Cc'] == df_meta.loc[inds, 'Bcc']).mean()}%")

# This means that we can drop Bcc, since it is the same as Cc

In [None]:
# drop columns we don't need
# - Bcc is the same as Cc
# - Message-ID is not needed, we use the dataframe index as the id
# - Mime-Version is the same for all emails
df_meta = df_meta.drop(["Bcc", "Message-ID", "Mime-Version"], axis=1)

In [None]:
# merge with the user/path info
df_meta = df_meta.merge(df, left_index=True, right_index=True, how="left")
del df

In [None]:
# save the data (set to True)
if False:
    df_meta.to_csv(str(data_path.parent / "meta.csv"))

### Email time / date
The time and date that the email was sent, including the timezone, is stored in the metadata.
<br>
No values are missing. We can parse these, and explore when emails are sent/received.

In [None]:
# timezones (probably just tell us time of year)
df_meta["Date"].str.replace(r'[^(]*\(|\)[^)]*', '').value_counts()

In [None]:
# convert to a datetime, using UTC (since the dates are timezone aware)
df_meta["datetime_utc"] = pd.to_datetime(df_meta["Date"], utc=True)

In [None]:
# First look at the year:

# Most data is from 2000 - 2001
# 1999 and 2002 also have data

# Some are clearly incorrect (e.g. 1980, 2043)

print("value counts for years")
display(df_meta["datetime_utc"].dt.year.value_counts())

In [None]:
# 1997 and 1998 could be actual emails

# 1997: most emails are from 1 user 
display(df_meta.loc[df_meta["datetime_utc"].dt.year == 1997, "user"].value_counts())
# 1998: most emails are from 1 user 
display(df_meta.loc[df_meta["datetime_utc"].dt.year == 1998, "user"].value_counts())

In [None]:
# some have a date of 1980: all same time, must be an error / placeholder valaue
df_meta[df_meta["datetime_utc"].dt.year == 1980]

In [None]:
# look at years that do not occur a lot: these seem to be spam (from none Enron email addresses)
# also all are outlier years from the data
# probably have data for 1997 to 2002

vc = df_meta["datetime_utc"].dt.year.value_counts()
for year in vc[vc.values < 100].index:
    print("")
    print("-" * 50)
    print(year)
    data = df_meta[df_meta["datetime_utc"].dt.year == year]
    display(data["From"].value_counts())

In [None]:
# have to remove 1980s and the other years where the counts are low for any analysis on years
years = list(vc[vc.values < 100].index) + [1980]


In [None]:
# distribution of emails by time
plt.figure(figsize=(18, 5))
plt.hist(df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"], bins=100)
plt.title("distribution of number of emails over time", fontsize=14)
plt.xlabel("date", fontsize=14)
plt.ylabel("count(emails)", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
# next look at basic date

# Large variation, some days with several 1000, some with 1

print("Value counts for dates:")
vc = df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"].dt.date.value_counts()
display(vc)

plt.figure(figsize=(18, 5))
plt.hist(df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"].dt.date, bins=500, density=1.0)
plt.title("Distribution of dates", fontsize=14)
plt.xlabel("date", fontsize=14)
plt.ylabel("P(date)", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
# day of the month is fairly constant (less months have 31 days)
vc = df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"].dt.day.value_counts()

plt.figure(figsize=(18, 5))
plt.bar(vc.index, vc.values)
plt.title("Count of day of month", fontsize=14)
plt.xlabel("day of month", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
# Weekends have less emails sent and received
# Mid-week has most

vc = df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"].dt.day_name().value_counts()
plt.figure(figsize=(18, 5))
plt.barh(vc.index, vc.values)
plt.title("Count of day of week", fontsize=14)
plt.ylabel("day of week", fontsize=14)
plt.xlabel("Count", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
# The summer months produce less emails
# Autumn produces the most (catch up after summer?)

vc = df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), "datetime_utc"].dt.month_name().value_counts()

plt.figure(figsize=(18, 5))
plt.barh(vc.index, vc.values)
plt.title("Count of month", fontsize=14)
plt.ylabel("month", fontsize=14)
plt.xlabel("Count", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
# email rates: how many emails each user sends/receives per day

tmp = (
    df_meta.loc[~df_meta["datetime_utc"].dt.year.isin(years), :]
    .groupby("user")["datetime_utc"].agg(["count", "min", "max"])
)
tmp["range"] = (tmp["max"] - tmp["min"]).dt.days
tmp["email_rate"] = tmp["count"] / tmp["range"]

In [None]:
display(tmp.sort_values(["email_rate", "count", "range"]))

In [None]:
# "linder-e" has the highest email rate: 2805 in 49 days. Some seem to be repeated
display(df_meta.loc[df_meta["user"] == "linder-e", "Subject"].value_counts())
print("-" * 80)
# "symes-k" is similar
display(df_meta.loc[df_meta["user"] == "symes-k", "Subject"].value_counts())

In [None]:
# also look at:
# words per day
# chars per day
# corr between words per email and num emails
# corr between words per email and avg. num recipients (to + cc + bcc, as a set)
display(tmp["email_rate"].agg(["min", "median", "max", "mean", "std"]))
plt.figure(figsize=(18, 5))
plt.hist(tmp["email_rate"], bins=100)
plt.title("distribution of email rates (emails per day)", fontsize=14)
plt.xlabel("email rate", fontsize=14)
plt.ylabel("count(email rate)", fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

### From and To
The From and To metadata keys show who sent and who recveived an email.


In [None]:
# Sometimes From == TO
tmp = df_meta[(df_meta["From"] == df_meta["To"])]
print(f"Number where From == To: {len(tmp)}, {round(100.0 * len(tmp) / len(df_meta), 2)}%")
display(tmp)

In [None]:
# often, but not always, these have CC's
vc = tmp["X-cc"].value_counts()
# trim the Ccs to 80 chars, easier to view
vc.index = [i[:80] for i in vc.index]
display(vc)

In [None]:
# Some Tos are missing
inds = df_meta["To"].isnull()
print(f"Number of missing \"To\" entries:            {inds.sum()}, {round(100.0 * inds.mean(), 2)}%")
# A few have Cc values even with no To values
inds = df_meta[["To", "Cc"]].isnull().all(axis=1)
print(f"Number of missing \"To\" and \"Cc\" entries:   {inds.sum()}, {round(100.0 * inds.mean(), 2)}%")
#display(df_meta.loc[inds, "Cc"].str.split(", ").apply(len).value_counts())


### Subject
The email subject can be examined in two ways:
- length of the subject (e.g. number of words, number of chars)
- the actual subject (e.g. what terms are used) 
<br><br>
A full analysis of the data could be used to find correlations between e.g. subject length and the time to get a response.

In [None]:
# The length in chars and words
subject_char_lens = df_meta["Subject"].fillna('').apply(len)
subject_word_lens = df_meta["Subject"].fillna('').str.split().apply(len)

plt.figure(figsize=(18, 5))

plt.subplot(1, 2, 1)
plt.hist(subject_char_lens, bins=100)
plt.title("Distribution of subject lengths (chars)", fontsize=12)
plt.xlabel("len", fontsize=12)
plt.ylabel("P(len)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.subplot(1, 2, 2)
plt.hist(subject_word_lens, bins=100)
plt.title("Distribution of subject lengths (words)", fontsize=12)
plt.xlabel("len", fontsize=12)
plt.ylabel("P(len)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Note: User, X-Origin and From are not the same thing
# Below shows, for each user, the number of unique X-Origin values, with some having 2+
df_meta.groupby("user")["X-Origin"].agg(["count", "nunique"])

In [None]:
print(f"Unique X-Origin: {len(df_meta['X-Origin'].unique())}")
print(f"Unique X-From:   {len(df_meta['X-From'].unique())}")
print(f"Unique From:     {len(df_meta['From'].unique())}")
print(f"Unique user:     {len(df_meta['user'].unique())}")

# loop over each user, find those with more than 2 X-Origin values
# Some appear to be misspellings of the same name, others are mixed up
# Would need to check that the data is stored in the correct folder

for t in df_meta["user"].unique():
    vc = df_meta.loc[df_meta["user"] == t, "X-Origin"].str.lower().value_counts()
    if len(vc) > 1:
        print("\n")
        print("-" * 80)
        print(f"User: {t}")
        display(vc.to_frame())


### Message bodies
The body of the message is read in as follows:
- take the lines in the email file below the final metadata line
- strip any html in the email
- merge the lines into a single string
- replace all whitespace with a single space
<br><br>
The messages may contain additional text, e.g. the message they are replying to (Original Message), or the message being forwarded (Forwarded Message). For now, we leave these in the messages.

In [None]:
start_time = time.time()

msgs_all = {}

# extract message bodies for a sample of N emails (use all or a sub-sample)
# Note: it can take ~50mins to read in all the messages (for ~500,000 emails)
N = len(df_meta)
if N < len(df_meta):
    inds = sorted(np.random.choice(range(len(df_meta)), N, replace=False))
else:
    inds = list(df_meta.index)

for i, ind in enumerate(inds):
    fname = os.path.join(data_path, "/".join(df_meta.loc[ind, ["user", "path", "fname"]].values))
    with open(fname, "r", errors='replace') as f:
        contents = f.readlines()
        try:
            msg = get_email_body(contents)
        except ValueError:
            print(f"{i}: No email body for {fname}")
            msg = ""
        msgs_all[ind] = msg
    if np.remainder((i + 1), int(N / 100)) == 0:
        print(f"complete: {round(100.0 * (i + 1) / N, 2)}% in {round(time.time() - start_time, 2)}s")
df_msgs = pd.DataFrame(msgs_all.values(), index=msgs_all.keys(), columns=["msg"])
print(time.time() - start_time)
display(df_msgs.sample(n=20))


In [None]:
if False:
    # save the messages (set to True to do so)
    df_msgs.to_csv(str(data_path.parent / "msgs.csv"))
if False:
    # read in the messages (set to True to do so)
    df_msgs = pd.read_csv(str(data_path.parent / "msgs.csv"), index_col=0)
    df_meta = pd.read_csv(str(data_path.parent / "meta.csv"), index_col=0)
    print(df_meta.shape, df_msgs.shape)

In [None]:
# remove forwarded and original messages
if False:
    df_msgs["msg"] = df_msgs["msg"].str.split("----- Forwarded").str[0]
    df_msgs["msg"] = df_msgs["msg"].str.split("-----Original").str[0]

In [None]:
# remove missing messages (only had html)
inds_keep = df_msgs[df_msgs["msg"].notnull()].index
df_msgs = df_msgs.loc[inds_keep, :]
df_meta = df_meta.loc[inds_keep, :]
print(df_meta.shape, df_msgs.shape)

In [None]:
# Some messages are duplicated (about 50%)
# These can be removed for the following analysis
f"Duplicated messages: {round(100.0 * df_msgs.duplicated().mean(), 2)}%"
df_meta = df_meta.loc[~df_msgs.duplicated(), :]
df_msgs = df_msgs.loc[~df_msgs.duplicated(), :]
print(df_meta.shape, df_msgs.shape)

In [None]:
# data now read in and stored

### Text clustering
Below we implement an algorithm to cluster the data:
- Split the data into train, validation and test sets
- Fit a TF-IDF vectorizer to the train set
- Extract TF-IDF feature vectors for the train, val and test sets
- Fit a K-Means model to the train set, using the val set to find the optimal number of centers (using the Silhouette Score)
<br><br>

The Silhouette Score goes from -1 (bad fit) to +1 (good fit) and is the mean Silhouette Coefficient of all samples, which is given by `(b - a) / max(a, b)`, where
- a: the mean intra-cluster distance for each sample.
- b: the mean nearest-cluster distance for each sample. 
<br><br>

Once this model is fitted, we can use PCA to reduce the feature vectors to 2d and visualize the clusters.
<br><br>

The aim is to find patterns in the clusters, for e.g.:
- can we determine types of email (e.g. personal vs business, spam emails)
- can we find an interesting sub-set of emails to examine)

In [None]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, IncrementalPCA, NMF

from sklearn.metrics import silhouette_score

In [None]:
# split the data:
# fit to a random sub-sample of the dataset
# predict on a different sub-sample
inds = list(df_msgs.index)
np.random.shuffle(inds)

n_train = 50000
n_val = 10000
n_test = 10000

inds_train = inds[:n_train]
inds_val = inds[n_train : n_train + n_val]
inds_test = inds[n_train + n_val : n_train + n_val + n_test]

df_msgs_train = df_msgs.loc[inds_train, :].copy(deep=True)
df_msgs_test = df_msgs.loc[inds_test, :].copy(deep=True)
df_msgs_val = df_msgs.loc[inds_val, :].copy(deep=True)
del df_msgs

df_meta_train = df_meta.loc[inds_train, :].copy(deep=True)
df_meta_test = df_meta.loc[inds_test, :].copy(deep=True)
df_meta_val = df_meta.loc[inds_val, :].copy(deep=True)
del df_meta

In [None]:
# strip and stem the messages
df_msgs_train["msg_strip"] = strip_string(df_msgs_train["msg"].values.tolist())
df_msgs_val["msg_strip"] = strip_string(df_msgs_val["msg"].values.tolist())
df_msgs_test["msg_strip"] = strip_string(df_msgs_test["msg"].values.tolist())

# note, results are worse with stemming, so leave at this time
if False:
    df_msgs_train["msg_strip"] = stem_text(df_msgs_train["msg_strip"].values.tolist())
    df_msgs_val["msg_strip"] = stem_text(df_msgs_val["msg_strip"].values.tolist())
    df_msgs_test["msg_strip"] = stem_text(df_msgs_test["msg_strip"].values.tolist())

In [None]:
# remove named entities, replace with their type
# not used now due to speed issues.
if False:
    df_msgs_train["msg_strip"] = remove_named_entities(df_msgs_train["msg_strip"].values.tolist())
    df_msgs_val["msg_strip"] = remove_named_entities(df_msgs_val["msg_strip"].values.tolist())
    df_msgs_test["msg_strip"] = remove_named_entities(df_msgs_test["msg_strip"].values.tolist())


In [None]:
# fit the vectorizer to all the data, keep 10,000 features only due to memory constraints
vec = fit_tf_idf(
    msgs=df_msgs_train["msg_strip"].values.tolist(),
    max_features=10000,
    max_df=0.95,
    min_df=2
)

# get the features for the sample data
features_train = vec.transform(df_msgs_train["msg_strip"].values.tolist())
features_val = vec.transform(df_msgs_val["msg_strip"].values.tolist())

In [None]:
# leave the non-alpha terms / names / etc. in for now

# we could replace names (e.g. "glynn") with a placeholder (e.g. "PERSON"), although this
# operation is expensive and will be left for now

print(np.random.choice(list(vec.vocabulary_.keys()), 100, replace=False))
vocab = list(vec.vocabulary_.keys())

In [None]:
# loop through number of clusters, see which has lowest silhouette score

for n_clusters in [2, 3, 4, 5, 6, 7, 8]:
    cls = MiniBatchKMeans(n_clusters=n_clusters, random_state=0)
    cls.fit(features_train)
    # predict cluster labels for new dataset
    df_msgs_val[f"label_{n_clusters}"] = cls.predict(features_val)
    # to get cluster labels for the dataset used while
    # training the model (used for models that does not
    # support prediction on new dataset).
    vc = df_msgs_val[f"label_{n_clusters}"].value_counts()
    # get a sample for each label
    sample_size = min(vc.min(), int(8000 / n_clusters))
    if sample_size != 1:
        inds = []
        for i in vc.index:
            inds += list(df_msgs_val[df_msgs_val[f"label_{n_clusters}"] == i].sample(n=sample_size).index)
        inds = df_msgs_val.index.isin(inds)
        sil_score = silhouette_score(
                features_val[inds],
                labels=df_msgs_val[f"label_{n_clusters}"].values[inds]
            )
        print(n_clusters, sil_score, vc.to_dict())

- 2 and 3 and 7 both have roughly similar scores (we're using a small subsample of the data, so we'd expect large errors in these values).
- The analysis will continue with 3 clusters:
    - start with a small number, and increase later on

In [None]:
# will use 3, re-fit to the data
n_clusters = 3

cls = MiniBatchKMeans(n_clusters=n_clusters, random_state=0)
cls.fit(features_train)

# predict cluster labels for new dataset
features_test = vec.transform(df_msgs_test["msg_strip"].values.tolist())
df_msgs_test["label"] = cls.predict(features_test)

# to get cluster labels for the dataset used while
# training the model (used for models that does not
# support prediction on new dataset).
vc = df_msgs_test["label"].value_counts()

sample_size = min(vc.min(), int(8000 / n_clusters))
if sample_size != 1:
    inds = []
    for i in vc.index:
        inds += list(df_msgs_test[df_msgs_test["label"] == i].sample(n=sample_size).index)
    inds = df_msgs_test.index.isin(inds)
    sil_score = silhouette_score(
            features_test[inds],
            labels=df_msgs_test["label"].values[inds]
        )
    print(n_clusters, sil_score, vc.to_dict())

In [None]:
# reduce the features to 2D using PCA
pca = PCA(n_components=2, random_state=0)
# use the incemental PCA for large datasets
#pca = IncrementalPCA(n_components=2, batch_size=1000)

# fit and transform
reduced_features = pca.fit_transform(features_test.toarray())

# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(cls.cluster_centers_)

In [None]:
# plot the results
# cluster centers are plotted as black dots
plt.figure(figsize=(8, 8))
for i in df_msgs_test["label"].value_counts().index:
    inds = df_msgs_test["label"] == i
    plt.plot(
        reduced_features[inds, 0],
        reduced_features[inds, 1],
        marker="o",
        ms=3,
        linestyle="None",
        alpha=0.5,
        label=i)
plt.plot(
    reduced_cluster_centers[:, 0],
    reduced_cluster_centers[:, 1],
    marker='o',
    ms=8,
    linestyle="None",
    color="black"
)
plt.legend()
plt.show()

- The cluster centers are reasonably well separated
- Some overlap between them
<br>

We can now examine samples from within the clusters by:
- taking the samples that are closest to their cluster center.
- taking the samples that are furthest from (0, 0) for each predicted cluster.

In [None]:
# take the closest 10 results to each cluster center
closest = {}
for i in df_msgs_test["label"].value_counts().index:
    inds = df_msgs_test["label"] == i
    center = reduced_cluster_centers[i, :]
    # use L2 norm, sort by that distance
    dists = np.linalg.norm(reduced_features - center, axis=1)
    inds_sort = np.argsort(dists)
    # take the 10 closest
    closest[i] = list(df_msgs_test.iloc[inds_sort].loc[inds, :].index)[:10]

In [None]:
# view the messages,
for i, inds in closest.items():
    print("\n\n")
    print("=" * 50)
    print(i)
    print("=" * 50)
    print("\n\n".join(df_msgs_test.loc[inds, "msg_strip"].str[:400]))

From a brief look at the emails, the clearest cluster is 0, which show some kind of automated emails.
- 0: Some kind of automated email?
- 1, 2: Seem to be the same?
<br>

Looking at the furthest form (0, 0) for each case, it looks like
- 1: to do with scheduling
- 2: business emails

In [None]:
# take the furthest 10 results from (0, 0) for each predicted cluster
furthest = {}
for i in df_msgs_test["label"].value_counts().index:
    inds = df_msgs_test["label"] == i
    # use L2 norm, sort by that distance
    dists = np.linalg.norm(reduced_features, axis=1)
    # take the 10 closest
    furthest[i] = list(df_msgs_test.iloc[inds_sort].loc[inds, :].index)[:10]

In [None]:
# view the messages
for i, inds in furthest.items():
    print("\n\n")
    print("=" * 50)
    print(i)
    print("=" * 50)
    print("\n\n".join(df_msgs_test.loc[inds, "msg_strip"].str[:500]))

In [None]:
# Where the emails came from and went to, by cluster
for i in df_msgs_test["label"].value_counts().index:
    print(i)
    # from
    vc_from = (
        df_meta_test.loc[df_msgs_test["label"] == i, "From"].fillna("@no_from").str.split("@").str[-1].value_counts(normalize=True)
    ).to_frame()
    # to
    vc_to = (
        df_meta_test.loc[df_msgs_test["label"] == i, "To"].fillna("@no_to").str.split("@").str[-1].value_counts(normalize=True)
    ).to_frame()
    # Cc
    all_cc = []
    for sub_list in (
        df_meta_test.loc[df_msgs_test["label"] == i, "Cc"].fillna("@no_cc").str.split(", ").apply(lambda x: [i.split("@")[-1] for i in x])
    ):
        all_cc += sub_list
    vc_cc = pd.Series(all_cc).value_counts(normalize=True).to_frame()
    vc_cc.columns = ["Cc"]
    # merge them
    vc_from = vc_from.merge(vc_to, left_index=True, right_index=True, how="outer").fillna(0)
    vc_from = vc_from.merge(
        vc_cc, left_index=True, right_index=True, how="outer"
    ).fillna(0).sort_values(["Cc", "From", "To"], ascending=False)
    display(vc_from)

Clusters 0 is mostly from Enron email address, clusters 1 and 2 have more external emails

### Topic analysis
In addition to the cluster analysis, we carry out a topic analysis. This uses Non-negative Matrix Factorization (NMF), where two non-negative matrices (W, H) are found whose product approximates the input feature matrix X (which is also non-negative, comprising an array of feature vectors for multiple samples).

- X: n_samples by n_features
- W: n_samples by n_components
- H: n_components by n_features

<br>

The procedure is carried out as follows:
- Extract nouns from the email messages. The topic is defined by the nouns in the message. (Note: this is slow, so we only use the val and test datasets.
- Fit a TF-IDF vectorizer to the val dataset.
- Use this to get feature arrays for the val and test datasets.
- Fit the NMF model to the val dataset, for a given number of components (we use 4, but could experiment with more/less: requries a scoring algoirthm to decide the best).
- Use this to get an idea of what groups of emails cover

In [None]:
# warning, slow!

#df_msgs_train["nouns"] = extract_message_nouns(df_msgs_train["msg"], nlp)
#print("train done")
df_msgs_val["nouns"] = extract_message_nouns(df_msgs_val["msg"], nlp)
print("val done")
df_msgs_test["nouns"] = extract_message_nouns(df_msgs_test["msg"], nlp)
print("test done")

In [None]:
# number of topics to extract
n_topics = 4
# fit TF-IDF vectorizer to the train data
# use less features than before, as we only have nouns now
vec = TfidfVectorizer(max_features=5000, stop_words="english", max_df=0.95, min_df=2)
vec.fit(df_msgs_val["nouns"])

# list of unique words found by the vectorizer
feature_names = vec.get_feature_names()

# extract features for each set
#features_train = vec.transform(df_msgs_train["nouns"])
features_val = vec.transform(df_msgs_val["nouns"])
features_test = vec.transform(df_msgs_test["nouns"])


plt.figure(figsize=(10, 25))

for j, beta_loss in enumerate(["kullback-leibler", "frobenius"]):
    # git an NMF model the train data
    cls = NMF(n_components=n_topics, beta_loss=beta_loss, solver='mu', random_state=0)
    cls.fit(features_val)

    # number of most influencing words to display per topic
    n_top_words = 10
    
    for i, topic_vec in enumerate(cls.components_):
        plt.subplot(5, 2, 2 * i + 1 + j)
        inds = topic_vec.argsort()[:-n_top_words - 1:-1]
        plt.barh(np.array(feature_names)[inds][::-1], topic_vec[inds][::-1])
        plt.title(f"{i} - {beta_loss}")
plt.tight_layout()
plt.show()

- 0: ECT: Enron Capital and Trade Resources
- 1: business (e.g. price, market, company)
- 2: communication based (mail, message etc.)
- 3: thanking you + scheduling

In [None]:
nmf_labels_test = cls.transform(features_test)
nmf_labels_val = cls.transform(features_val)

In [None]:
# see how many are in each class

# Mostly business related class
# Least in communication class

vc = pd.Series(nmf_labels_test.argmax(axis=1)).value_counts().to_frame().reset_index().sort_values("index")
vc.columns = ["cluster", "count"]
ind_map = {
    0: "ECT",
    1: "business",
    2: "communication",
    3: "scheduling"
}

vc["label"] = vc["cluster"].map(ind_map)
display(vc)

### Conclusions from the data

- The dataset can be used to provide insights into employees email habits (e.g. rates, size of emails sent, when they are sent etc.). This assumes that the dataset is a representative sample of all emails sent by these users.
- The dataset requires cleaning before use (parsing the emails into metadata and body, dealing with missing/incorrect metadata etc.)
- An initial analysis of the messages shows clusters of topics, e.g. relating to scheduling meetings or to business information.

##### Further work
- The main improvement to this analysis would involve improved parsing of the email message bodies.
- The clustering used here could be used as the starting point to labelling the data:
    - label work and personal emails, to classify work emails for further analysis
    - label routine emails (e.g. automatic meeting reminders) and non-routine, so that routine emails can be removed from the dataset
- Reconstructing the email chain (i.e. find replies to emails) could be used to give insights into e.g. what types of messages elicit responses, what subjects are best etc.