# Network Analysis

Network analysis can be utilized to understand communication patterns between employees.
Building an email network is helpful with network analysis to visualize communication between individuals 
by creating a graph with representation for each employee and email communication. 
When developing a network analysis measurements of network distribution will be important 
to identify to determine how nodes and edges are distributed in a network. 
This includes the important theory of betweenness centrality, which can show which
nodes are likely pathways of information and what employees would act as bridges to facilitate
communication for wrongdoing acts.
With use of the Python package, NetworkX, creation and analysis can be performed for complex networks. 
Email networks can help
uncover key individuals, groups and relationships.

- `nxviz`
- `G = nx.from_pandas_dataframe(data, 'sender', 'recipient1', edge_attr=['date', 'subject'])`
- `nxviz.ArcPlot`
- `nxviz.CircosPlot`
- `networkx.draw_networkx(G, networkx.spring_layout(G, k=0.1), node_size=25, node_color='red', with_labels=True, edge_color='blue'))`
    - `k` or spring tension in `spring_layout` changes the visualization (small k is more useful)
- Degree Centrality
- Betweenness Centraility

[Kaggle - Enron Network Analysis](https://www.kaggle.com/code/jamestollefson/enron-network-analysis)

- Anomaly Detection, Social Network Analysis, Email Body Analysis

[Enron-Email-Analysis](https://github.com/mihir-m-gandhi/Enron-Email-Analysis)

[Network Analysis with the Enron Email Corpus](https://www.tandfonline.com/doi/pdf/10.1080/10691898.2015.11889734)
[Exploration of Communication Networks from the Enron Email Corpus](http://www.casos.cs.cmu.edu/publications/protected/2005-2006/diesner_2005_explorationsenron.pdf)


Social Network Analysis

Refer to the Python package `networkx` for information on network analysis in Python [1].
Good examples of social network analysis with Python in [2].

- Pre-Processing:
    - Load the dataset
    - Clean Data: Remove any emails with missing information, irrelevant emails, etc.
- Build Network
    - Directed Graph: e.g., sender and recipient represent the nodes, and each email is a directed edge from sender to recipient
    - Analyze Network Properties: Centrality bar charts (Refer to [2])
        - Degree Centrality: Identify individuals who sent or received the most emails.
        - Betweenness Centrality: Identify key individuals who serve as intermediaries.
        - Closeness Centrality: Find individuals who are closest to others on average.
        - Eigenvector Centrality: Help identify key figures who are well-connected to other influential individuals, providing deeper insight into the power dynamics within the Enron network.
            - High Eigenvector Centrality: Individuals with high scores are not only well-connected but connected to other influential people in the network. 
            These could represent core members of critical communication networks or leaders in the organization.
            - Low Eigenvector Centrality: Individuals with low scores are likely either peripheral or isolated in the network or are only connected to others with similarly low influence.
- Additional Network Ideas:
    - Detect Communities: Community detection algorithms (e.g., Girvan-Newman) to find groups within the network.
    - Analyze Clusters: Investigate clusters within the network to understand isolated teams or departments.
- Visualize Network:
    - Color-code communities or size nodes by centrality for
    - Extra detail for more information: Color-code communities or size nodes by centrality
- Reference NetworkX: [bibtex](http://conference.scipy.org.s3-website-us-east-1.amazonaws.com/proceedings/scipy2008/paper_2/reference.bib)

[1](https://networkx.org/documentation/stable/index.html)
[2](https://link.springer.com/book/10.1007/978-3-319-53004-8)

## Import Modules

In [None]:
import numpy as np
import os
import pandas as pd
import sqlite3
import re
import json
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import nxviz as nv

from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Get the full path to the root directory
# os.path.dirname(os.getcwd()) is root dir - INTA6450_Enron/ folder
root_dir = os.path.dirname(os.getcwd())
print(f"Root directory: {root_dir}")
# Current working directory
cwd = os.getcwd()
print(f"Current working directory: {cwd}")

## Load Database

In [None]:
# INTA6450_Enron/data/emails.db
path_db = f"{root_dir}/data/emails.db"

# Table name for DataFrame saved in the database
table_name = "emails"

# Connect to the database (or create it if it doesn't exist)
connection = sqlite3.connect(path_db)

# Create a cursor object to execute SQL commands
cursor = connection.cursor()

# Load the dataframe from the SQLite database
emails_df = pd.read_sql_query(f"SELECT * FROM {table_name}", connection)

# Close the connection
connection.close()

# Show email data
emails_df.head()

# Examine Data

Look at the first email in the DataFrame.

In [None]:
# Get the first email text
text = emails_df.iloc[0]["text"]

# Remove newline characters
text_newline = re.sub(r"\n", " ", text)

# Prepare the data dictionary for JSON
data = {"text": text, "text_clean": text_newline}

# Save to text.json
# file_path = f"{cwd}/text.json"
# with open(file_path, "w") as file:
#     json.dump(data, file)

# Get the text
text = data["text"]
clean = data["text_clean"]

# Print the formatted text
print(text)

# Unique Folder Count

Number of Emails in the data base that were obtained from a specific folder.

## Filter Data

In [None]:
df = emails_df.copy()

print("number of folders: ", df.shape[0])
print("number of unique folders: ", df["folder"].unique().shape[0])

unique_emails = pd.DataFrame(df["folder"].value_counts())
unique_emails.reset_index(inplace=True)


unique_emails.columns = ["folder_name", "count"]
unique_emails.head(10)

## Plot Top 20 Folders

In [None]:
plt.figure(figsize=(10, 6))

# Grid
# plt.grid(zorder=1)

# Top 20 folders
data_filter = unique_emails.iloc[:20, :]

# Color palette
palette = sns.color_palette("hls", len(data_filter))

ax = sns.barplot(
    x="count", y="folder_name", data=data_filter, 
    palette=palette, zorder=2
)

plt.title(f"TOP {len(data_filter)} FOLDERS", fontsize=14, fontweight="bold")
plt.xlabel("Number of Emails")
plt.ylabel("Email Folder Name")

# Adjust font size of y-axis tick labels
ax.tick_params(axis="y", labelsize=8)

# Adding text labels
for container in ax.containers:
    ax.bar_label(container, fmt = "%.0f", label_type="edge", padding=-25, 
                 fontsize = 8, fontweight="bold", zorder = 3)

# Save the plot before showing it
plt.savefig(f"{root_dir}/data/figures/top_20_folders_email_count.png", format="png", dpi=300, bbox_inches="tight")

# Show plot
plt.show()

# Emails Sent

## Filter Data

- Cleaned up majority of email addresses that contained apostrophes, `<', '>', and duplicate periods in the name.

In [None]:
df = emails_df.copy()

# Fix broken emails
# Remove apostrophes from 'from' and 'x-from' columns
# df["from"] = df["from"].str.replace("'", "", regex=False)
# # Remove " <" before the email
# df['from'] = df['from'].str.replace(r'\s<', '', regex=True) 
# # Remove ">" after the email
# df['from'] = df['from'].str.replace(r'>', '', regex=True)
# # Replace double periods with a single period
# df['from'] = df['from'].str.replace(r'\.\.', '.', regex=True)



def clean_email_address(email):
    """Clean and fix broken email addresses."""
    if not isinstance(email, str):
        return email  # Return the value unchanged if it's not a string
    
    # Remove apostrophes
    email = re.sub(r"'", "", email)
    # Remove " <" before the email
    email = re.sub(r'\s<', '', email)
    # Remove ">" after the email
    email = re.sub(r'>', '', email)
    # Replace double periods with a single period
    email = re.sub(r'\.\.', '.', email)
    
    return email

# Apply the function to the DataFrame
df['from'] = df['from'].apply(clean_email_address)


# TODO: Write to db

# Show email data
df.head()

In [None]:
df["text"].iloc[0]

In [None]:
def clean_text(text):
    """Clean the email text by removing unwanted characters and formatting."""
    # Remove newline characters
    text = re.sub(r'\n', ' ', text)
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    # Remove any non-alphanumeric characters except spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

# Apply the clean_text function to the 'text' column
clean = df['text'].apply(clean_text)

# Display the cleaned text of the first email
# df['clean_text'].iloc[0]
clean[0]

In [None]:
clean[4]

## Invalid Email Addresses

- The invalid email address were identified that do not contain `first.last@email`
- 5 invalid email addresses that sent 5 total emails is not too bad for 250K total emails

## Most Sent Emails

- Filter the data to determine the emails with the most sent emails in the database

In [None]:
# Create a DataFrame to count unique pairs of 'from' and 'x-from'
df_senders = df.groupby(["from"]).size().reset_index(name="Counts")

# Sort the DataFrame by 'Counts' in descending order
df_senders = df_senders.sort_values(by="Counts", ascending=False)

# Reset the index to be sequential from 0 to N
df_senders.reset_index(drop=True, inplace=True)

# Rename the columns in df_senders
df_senders = df_senders.rename(columns={"from": "Sender Address"})

# Display the resulting DataFrame
df_senders.head(20)

## Search Email Data

In [None]:
# Search for rows in the 'Address' column that contain legal
df_senders_filtered = df_senders[
    df_senders["Sender Address"].str.contains(r"piassick", regex=True, na=False)
]
df_senders_filtered.reset_index(drop=True, inplace=True)
# Display the filtered rows
df_senders_filtered.head(20)

## Email Meta Data

In [None]:
# Calculate total number of emails and total newline count
total_emails = len(emails_df)
total_newlines = emails_df["text"].str.count("\n").sum()

# Count email addresses in each email body and calculate total email addresses
emails_df["email_count"] = emails_df["text"].str.count(r"\S+@\S+")
total_email_addresses = emails_df["email_count"].sum()

# Count URLs in each email body
emails_df["url_count"] = emails_df["text"].str.count(r"http\S+|www\S+")
total_urls = emails_df["url_count"].sum()

# Count special characters and punctuation in each email body
emails_df["special_char_count"] = emails_df["text"].str.count(r"[^a-zA-Z\s]")
total_special_chars = emails_df["special_char_count"].sum()

# Count words in each email body
emails_df["word_count"] = emails_df["text"].str.findall(r"\b\w+\b").str.len()
total_words = emails_df["word_count"].sum()

# Count sentences in each email body (heuristic based on sentence-ending punctuation)
emails_df["sentence_count"] = emails_df["text"].str.count(r"[.!?]")
total_sentences = emails_df["sentence_count"].sum()

# Create the new DataFrame with newline counts
newline_counts_df = emails_df[["text"]].copy()

# Add counts into df
newline_counts_df["New Lines \\n"] = emails_df["text"].str.count("\n")
newline_counts_df["Email Count"] = emails_df["email_count"]
newline_counts_df["URL Count"] = emails_df["url_count"]
newline_counts_df["Special Char Count"] = emails_df["special_char_count"]
newline_counts_df["Word Count"] = emails_df["word_count"]
newline_counts_df["Sentence Count"] = emails_df["sentence_count"]

# Create a new row for the totals and insert it at the top of the DataFrame
totals_row = pd.DataFrame(
    {
        "text": ["Total Emails"],
        "New Lines \\n": [total_newlines],
        "Email Count": [total_email_addresses],
        "URL Count": [total_urls],
        "Special Char Count": [total_special_chars],
        "Word Count": [total_words],
        "Sentence Count": [total_sentences],
    }
)
newline_counts_df = pd.concat([totals_row, newline_counts_df], ignore_index=True)

# Add total emails count as the first row
# newline_counts_df.at[0, "text"] = f"Total Emails: {total_emails}"

newline_counts_df.head()  # Display the first few rows to verify

In [None]:
tim_sent_emails_df = emails_df[
        emails_df["from"].str.contains(r"tim.belden@enron.com", regex=True, na=False)
]

tim_sent_emails_df['to'] = tim_sent_emails_df['to'].str.split(',')
tim_sent_emails_df = tim_sent_emails_df.explode('to').reset_index(drop=True)
tim_sent_emails_df = tim_sent_emails_df.drop_duplicates(subset=['to']).reset_index(drop=True)

# Create graph
G = nx.from_pandas_edgelist(tim_sent_emails_df, 'to', 'from', edge_attr=['date', 'subject'])

# Plot using ArcPlot
plot = nv.ArcPlot(G)
plt.figure(figsize=(20, 20))

# Use a different layout to reduce overlap
pos = nx.spring_layout(G, k=0.3, iterations=50)

# Draw the graph
nx.draw_networkx(
    G, pos, node_size=50, node_color='red', with_labels=True, edge_color='blue', font_size=8
)
plt.title("Tim Belden Sent Emails")

# Show plot
plt.show()

In [None]:
tim_received_emails_df = emails_df[
        emails_df["to"].str.contains(r"tim.belden@enron.com", regex=True, na=False)
]

tim_received_emails_df['to'] = tim_received_emails_df['to'].str.split(',')
tim_received_emails_df = tim_received_emails_df.explode('to').reset_index(drop=True)
tim_received_emails_df = tim_received_emails_df.drop_duplicates(subset=['from']).reset_index(drop=True)
tim_received_emails_df = tim_received_emails_df.drop_duplicates(subset=['to']).reset_index(drop=True)
tim_received_emails_df['to'] = 'tim.belden@enron.com'
# Create graph
G = nx.from_pandas_edgelist(tim_received_emails_df, 'to', 'from', edge_attr=['date', 'subject'])

# Plot using ArcPlot
plot = nv.ArcPlot(G)
plt.figure(figsize=(20, 20))

# Use a different layout to reduce overlap
pos = nx.spring_layout(G, k=0.3, iterations=50)

# Draw the graph
nx.draw_networkx(
    G, pos, node_size=50, node_color='red', with_labels=True, edge_color='blue', font_size=8
)
plt.title("Tim Belden Received Emails")

# Show plot
plt.show()

In [None]:
tim_emails_recipients_df = emails_df[
        emails_df["from"].str.contains(r"tim.belden@enron.com", regex=True, na=False)
]

tim_emails_recipients_df['to'] = tim_emails_recipients_df['to'].str.split(',')
tim_emails_recipients_df = tim_emails_recipients_df.explode('to').reset_index(drop=True)

tim_emails_recipients_df = pd.DataFrame(tim_emails_recipients_df['to'].value_counts()).iloc[:30, :]

#plt.figure(figsize=(20, 40))

# Color palette
palette = sns.color_palette("hls", len(tim_emails_recipients_df))

ax = sns.barplot(
    x="count", y='to', data=tim_emails_recipients_df, 
    palette=palette, zorder=2
)

plt.title(f"EMAIL RECIPIENTS OF TIM BELDEN", fontsize=14, fontweight="bold")
plt.xlabel("Number of Emails")
plt.ylabel("Recipients Name")

# Adjust font size of y-axis tick labels
ax.tick_params(axis="y", labelsize=8)

# Adding text labels
for container in ax.containers:
    ax.bar_label(container, fmt = "%.0f", label_type="edge", padding=-25, 
                 fontsize = 8, fontweight="bold", zorder = 3)


In [None]:
Portland_email_senders_df = emails_df[
        emails_df["to"].str.contains(r"portland", regex=True, na=False)
]



Portland_email_senders_df['to'] = Portland_email_senders_df['to'].str.split(',')
Portland_email_senders_df = Portland_email_senders_df.explode('to').reset_index(drop=True)

Portland_email_senders_df = Portland_email_senders_df[
        ~Portland_email_senders_df["to"].str.contains(r"portland", regex=True, na=False)
]

Portland_email_senders_df = pd.DataFrame(Portland_email_senders_df['to'].value_counts()).iloc[:30, :]

#plt.figure(figsize=(20, 40))

# Color palette
palette = sns.color_palette("hls", len(Portland_email_senders_df))

ax = sns.barplot(
    x="count", y='to', data=Portland_email_senders_df, 
    palette=palette, zorder=2
)

plt.title(f"INDIVIDUAL EMAILS TO PORTLAND OFFICE LISTSERVS", fontsize=14, fontweight="bold")
plt.xlabel("Number of Emails")
plt.ylabel("Sender Name")

# Adjust font size of y-axis tick labels
ax.tick_params(axis="y", labelsize=8)

# Adding text labels
for container in ax.containers:
    ax.bar_label(container, fmt = "%.0f", label_type="edge", padding=-25, 
                 fontsize = 8, fontweight="bold", zorder = 3)